Skip to main content

What Is Waiting for Us at the End? Inherent Biases of Game Story Endings in Large Language Models

  • Conference paper
  • First Online:
Interactive Storytelling (ICIDS 2023)


This study investigates biases present in large language models (LLMs) when utilized for narrative tasks, specifically in game story generation and story ending classification. Our experiment involves using popular LLMs, including GPT-3.5, GPT-4, and Llama 2, to generate game stories and classify their endings into three categories: positive, negative, and neutral. The results of our analysis reveal a notable bias towards positive-ending stories in the LLMs under examination. Moreover, we observe that GPT-4 and Llama 2 tend to classify stories into uninstructed categories, underscoring the critical importance of thoughtfully designing downstream systems that employ LLM-generated outputs. These findings provide a groundwork for the development of systems that incorporate LLMs in game story generation and classification. They also emphasize the necessity of being vigilant in addressing biases and improving system performance. By acknowledging and rectifying these biases, we can create more fair and accurate applications of LLMs in various narrative-based tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions


  1. 1.

    Converting a raw text string into a key-value object in memory.

  2. 2.

    As the temperature increases, the output from the model becomes more stochastic. The possible value range for ChatGPT is from 0 to 2, where 1 is the default value.

  3. 3.

  4. 4.


  1. Baek, S., Im, H., Ryu, J., et al.: PromptCrafter: crafting text-to-image prompt through mixed-initiative dialogue with LLM. arXiv preprint arXiv:2307.08985 (2023)

  2. Christiano, P., Leike, J., Brown, T.B., et al.: Deep reinforcement learning from human preferences (2023)

    Google Scholar 

  3. Dang, H., Mecke, L., Lehmann, F., et al.: How to prompt? Opportunities and challenges of zero-and few-shot learning for human-AI interaction in creative applications of generative models. arXiv preprint arXiv:2209.01390 (2022)

  4. Gilbert, L.: “Assassin’s Creed reminds us that history is human experience’’: students’ senses of empathy while playing a narrative video game. Theory Res. Soc. Educ. 47(1), 108–137 (2019).

    Article  Google Scholar 

  5. Grace, L.: Game type and game genre (2005). Accessed 22 Feb 2009

    Google Scholar 

  6. Jozefowicz, R., Vinyals, O., Schuster, M., et al.: Exploring the limits of language modeling (2016)

    Google Scholar 

  7. Kasneci, E., Sessler, K., Küchemann, S., et al.: ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 103, 102274 (2023).

  8. Khandelwal, U., He, H., Qi, P., et al.: Sharp nearby, fuzzy far away: how neural language models use context. arXiv preprint arXiv:1805.04623 (2018)

  9. Lanzi, P.L., Loiacono, D.: ChatGPT and other large language models as evolutionary engines for online interactive collaborative game design (2023)

    Google Scholar 

  10. Lu, Y., Bartolo, M., Moore, A., et al.: Fantastically ordered prompts and where to find them: overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021)

  11. Murray, J.: From game-story to cyberdrama. In: First Person: New Media as Story, Performance, and Game, vol. 1, pp. 2–11 (2004)

    Google Scholar 

  12. OpenAI: Introducing ChatGPT (2022).

  13. OpenAI: GPT-4 technical report (2023)

    Google Scholar 

  14. Porteous, J., Cavazza, M.: Controlling narrative generation with planning trajectories: the role of constraints. In: Iurgel, I.A., Zagalo, N., Petta, P. (eds.) ICIDS 2009. LNCS, vol. 5915, pp. 234–245. Springer, Heidelberg (2009).

    Chapter  Google Scholar 

  15. Roemmele, M., Gordon, A.S.: Creative help: a story writing assistant. In: Schoenau-Fog, H., Bruni, L.E., Louchart, S., Baceviciute, S. (eds.) ICIDS 2015. LNCS, vol. 9445, pp. 81–92. Springer, Cham (2015).

    Chapter  Google Scholar 

  16. Sallam, M.: ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare 11(6) (2023).

  17. Shaikh, O., Zhang, H., Held, W., et al.: On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning (2023)

    Google Scholar 

  18. Taveekitworachai, P., Abdullah, F., Dewantoro, M.F., et al.: ChatGPT4PCG competition: character-like level generation for science birds (2023)

    Google Scholar 

  19. Touvron, H., Martin, L., Stone, K., et al.: LLaMA 2: open foundation and fine-tuned chat models (2023)

    Google Scholar 

  20. Venkit, P.N., Gautam, S., Panchanadikar, R., et al.: Nationality bias in text generation (2023)

    Google Scholar 

  21. Värtinen, S., Hämäläinen, P., Guckelsberger, C.: Generating role-playing game quests with GPT language models. IEEE Trans. Games 1–12 (2022).

  22. Wang, G., Xie, Y., Jiang, Y., et al.: Voyager: an open-ended embodied agent with large language models (2023)

    Google Scholar 

  23. Wang, Z., Xie, Q., Ding, Z., et al.: Is ChatGPT a Good Sentiment Analyzer? A Preliminary Study (2023)

    Google Scholar 

  24. Webson, A., Pavlick, E.: Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247 (2021)

  25. Wei, J., Tay, Y., Bommasani, R., et al.: Emergent abilities of large language models (2022)

    Google Scholar 

  26. White, J., Fu, Q., Hays, S., et al.: A prompt pattern catalog to enhance prompt engineering with ChatGPT (2023)

    Google Scholar 

  27. Wu, M., Aji, A.F.: Style over substance: Evaluation biases for large language models (2023)

    Google Scholar 

  28. Yuan, A., Coenen, A., Reif, E., et al.: Wordcraft: story writing with large language models. In: 27th International Conference on Intelligent User Interfaces, IUI 2022, pp. 841–852. Association for Computing Machinery, New York (2022).

  29. Zhao, W.X., et al.: A survey of large language models (2023)

    Google Scholar 

  30. Zhou, Y., Muresanu, A.I., Han, Z., et al.: Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022)

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Pittawat Taveekitworachai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Taveekitworachai, P. et al. (2023). What Is Waiting for Us at the End? Inherent Biases of Game Story Endings in Large Language Models. In: Holloway-Attaway, L., Murray, J.T. (eds) Interactive Storytelling. ICIDS 2023. Lecture Notes in Computer Science, vol 14384. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-47657-0

  • Online ISBN: 978-3-031-47658-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics