Skip to main content
Log in

Language agents reduce the risk of existential catastrophe

  • Open Forum
  • Published:
AI & SOCIETY Aims and scope Submit manuscript

Abstract

Recent advances in natural-language processing have given rise to a new kind of AI architecture: the language agent. By repeatedly calling an LLM to perform a variety of cognitive tasks, language agents are able to function autonomously to pursue goals specified in natural language and stored in a human-readable format. Because of their architecture, language agents exhibit behavior that is predictable according to the laws of folk psychology: they function as though they have desires and beliefs, and then make and update plans to pursue their desires given their beliefs. We argue that the rise of language agents significantly reduces the probability of an existential catastrophe due to loss of control over an AGI. This is because the probability of such an existential catastrophe is proportional to the difficulty of aligning AGI systems, and language agents significantly reduce that difficulty. In particular, language agents help to resolve three important issues related to aligning AIs: reward misspecification, goal misgeneralization, and uninterpretability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

Notes

  1. The phenomenon we call reward misspecification is sometimes also called “reward hacking” (e.g. by Amodei et al. 2016), “specification gaming” (e.g. by Shah et al 2022), or, in the context of supervised learning, “outer misalignment.”

  2. As we understand it, the problem of goal misgeneralization is similar to the problem of “inner misalignment” (Hubinger et al. 2021).

  3. Hubinger et al. (2021) call this “side-effect alignment.”

  4. See Schroeder (2004) for further discussion of how reward-based learning produces new intrinsic desires for reliable means to one’s goals.

  5. Similar remarks apply to the Decision Transformer architecture developed by Chen et al. (2021).

  6. See Metz (2016).

  7. For more on interpretability in the setting of reinforcement learning, see Glanois et al. (2022).

  8. While we have been careful in this initial exposition to qualify our attributions of mental states like belief and desire to language agents, for the sake of brevity we will omit these qualifications in what follows. It is worth emphasizing, however, that none of our arguments depend on language agents having bona fide mental states as opposed to merely behaving as though they do. That said, we are sympathetic to the idea that language agents may have bona fide beliefs and desires—see our arguments in Goldstein and Kirk-Giannini (2023). Two particularly interesting questions here are whether language agents can respond to reasons and whether, following Schroeder (2004), desires must be systematically related to reward-based learning in ways that language agents cannot imitate.

  9. Some might worry that, because language agents store their beliefs and desires as natural language sentences, their performance will be limited by their inability to reason using partial beliefs (subjective probabilities) and utilities. While we are not aware of work which adapts language agents to reason using partial beliefs and credences, the same kind of process which is used by Park et al. (2023) to assign numerical importance scores to language agents’ beliefs could in principle be used to assign subjective probabilities to sentences and utilities to outcomes. We believe this is an interesting avenue for future research. Thanks to an anonymous referee for raising this issue.

  10. Project available at https://github.com/Significant-Gravitas/Auto-GPT.

  11. Project available at https://github.com/yoheinakajima/babyagi.

  12. See Wang et al. (2023).

  13. For more on the commonsense reasoning ability of language models, see Trinh and Le (2019).

  14. See the recent successes of Voyager at completing tasks in Minecraft (Wang et al. 2023).

  15. See Bubeck et al. (2023) for discussion.

  16. The safety of language agents could also be improved by creating multiple instances of the underlying LLM. In this setting, an action would only happen if (for example) all ten instances recommended the same plan for achieving the goal.

  17. For research in this direction, see Voyager’s skill library in Wang et al. (2023).

  18. Thanks to an anonymous referee for raising these concerns.

  19. Project available at https://segment-anything.com/.

  20. See https://yoshuabengio.org/2023/05/07/ai-scientists-safe-and-useful-ai/ for a recent proposal about how to use AI without developing agents.

References

  • Amodei D, Clark J (2016) Faulty reward functions in the wild. Blog Post. https://blog.openai.com/faulty-reward-functions/

  • Amodei D, Olah C, Steinhardt J, Christiano P, Schulman J, Mané D (2016) Concrete problems in AI safety. Manuscript. https://arxiv.org/abs/1606.06565

  • Bostrom N (2014) Superintelligence: paths, dangers, strategies. Oxford University Press

    Google Scholar 

  • Bubeck S, Chandrasekaran V, Eldan R, Gehrke J, Horvitz E, Kamar E, Lee P, Lee YT, Li Y, Lundberg S, Nori H, Palangi H, Ribeiro MT, Zhang Y (2023) Sparks of artificial general intelligence: early experiments with GPT-4. Manuscript. https://arxiv.org/abs/2303.12712

  • Burns C, Ye H, Klein D, Steinhardt J (2022) Discovering latent knowledge in language models without supervision. Manuscript. https://arxiv.org/abs/2212.03827

  • Cappelen H, Dever J (2021) Making AI intelligible. Oxford University Press

  • Carlsmith J (2021) Is power-seeking AI an existential risk? Manuscript https://arxiv.org/abs/2206.13353

  • Chen L, Lu K, Rajeswaran A, Lee K, Grover A, Laskin M, Abbeel P, Srinivas A, Mordatch I (2021) Decision transformer: reinforcement learning via sequence modeling. NeurIPS. 34:15084–15097

  • Christiano PF, Leike J, Brown TB, Martic M, Legg S, Amodei D (2017) Deep reinforcement learning from human preferences. NeurIPS. 30:4299–4307

  • Doshi-Velez F, Kortz M, Budish R, Bavitz C, Gershman S, O'Brien D, Scott K, Schieber S, Waldo J, Weinberger D, Weller A, Wood A (2017) Accountability of AI under the law: the role of explanation. Manuscript. https://arxiv.org/abs/1711.01134

  • Driess D, Xia F, Sajjadi MSM, Lynch C, Chowdhery A, Ichter B, Wahid A, Tompson J, Vuong Q, Yu T, Huang W, Chebotar Y, Sermanet P, Duckworth D, Levine Vanhoucke SV, Hausman Toussaint KM, Greff K, Florence P (2023) PaLM-E: an embodied multimodal language model. Manuscript. https://arxiv.org/abs/2303.03378

  • Glanois C, Weng P, Zimmer M, Li D, Yang T, Hao J, Liu W (2022) A survey on interpretable reinforcement learning. Manuscript. https://arxiv.org/abs/2112.13112

  • Goldstein S, Kirk-Giannini CD (2023) AI wellbeing. Manuscript. https://philpapers.org/archive/GOLAWE-4.pdf

  • Hubinger E, van Merwijk C, Mikulik V, Skalse J, Garrabrant S (2021) Risks from learned optimization in advanced machine learning systems. Manuscript. https://arxiv.org/pdf/1906.01820.pdf

  • Krakovna V, Uesato J, Mikulik V, Rahtz M, Everitt T, Kumar R, Kenton Z, Leike J, Legg S (2020) Specification gaming: the flip side of AI ingenuity. Blog Post. https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity

  • Langosco L, Koch J, Sharkey L, Pfau J, Krueger D (2022) Goal misgeneralization in deep reinforcement learning. In: Proceedings of the 39th International Conference on Machine Learning, pp 12004–12019

  • Metz C (2016) In two moves, AlphaGo and Lee Sedol redefined the future. Wired 16 March, 2016. https://www.wired.com/2016/03/two-moves-alphago-lee-sedol-redefined-future/

  • Olah C, Cammarata N, Schubert L, Goh G, Petrov M, Carter S (2020) Zoom in: an introduction to circuits. Distill. https://distill.pub/2020/circuits/zoom-in/

  • Omohundro S (2008) The basic AI drives. In: Wang P, Goertzel B, Franklin S (eds) Proceedings of the first conference on artificial general intelligence. IOS Press, pp 483–492

    Google Scholar 

  • Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano PF, Leike J, Lowe R (2022) Training language models to follow instructions with human feedback. NeurIPS. 35:27730–27744

  • Park JS, O'Brien JC, Cai CJ, Morris MR, Liang P, Bernstein MS (2023) Generative agents: interactive simulacra of human behavior. Manuscript. https://arxiv.org/abs/2304.03442

  • Perez E, Ringer S, Lukošiūtė K, Nguyen K, Chen E, Heiner S, Pettit C, Olsson C, Kundu S, Kadavath S, Jones A, Chen A, Mann B, Israel B, Seethor B, McKinnon C, Olah C, Yan D, Kaplan J (2022) Discovering language model behaviors with model-written evaluations. Manuscript. https://arxiv.org/abs/2212.09251

  • Popov I., Heess N, Lillicrap T, Hafner R, Barth-Maron G, Vecerik M, Lampe T, Tassa Y, Erez T, Riedmiller M (2017) Data-efficient deep reinforcement learning for dexterous manipulation. Manuscript. https://arxiv.org/abs/1704.03073

  • Reed S, Zolna K, Parisotto E, Colmenarejo SG, Novikov A, Barth-Maron G, Gimenez M, Sulsky Y, Kay J, Springenberg JT, Eccles T, Bruce J, Razavi A, Edwards A, Heess N, Chen Y, Hadsell R, Vinyals O, Bordbar M, and de Freitas N (2022) A generalist agent. Manuscript. https://arxiv.org/abs/2205.06175

  • Rudner TG, Toner H (2021) Key concepts in AI safety: interpretability in machine learning. Center for Security and Emerging Technology Issue Brief

    Google Scholar 

  • Schroeder T (2004) Three faces of desire. Oxford University Press

    Book  Google Scholar 

  • Shah R, Varma V, Kumar R, Phuong M, Krakovna V, Uesato J, Kenton Z (2022) Goal misgeneralization: why correct specifications aren't enough for correct goals. Manuscript. https://arxiv.org/abs/2210.01790

  • Trinh TH, Le QV (2019) Do language models have common sense? Manuscript. https://openreview.net/pdf?id=rkgfWh0qKX

  • Turpin M, Michael J, Perez E, Bowman SR (2023) Language models don't always say what they think: unfaithful explanations in chain-of-thought prompting. Manuscript. https://arxiv.org/abs/2305.04388

  • Wang G, Xie Y, Jiang Y, Mandlekar A, Xiao C, Zhu Y, Fan L, Anandkumar A (2023) Voyager: an open-ended embodied agent with large language models. Manuscript. https://arxiv.org/abs/2305.16291

Download references

Funding

This research was funded by The Center for AI Safety.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cameron Domenico Kirk-Giannini.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Goldstein, S., Kirk-Giannini, C.D. Language agents reduce the risk of existential catastrophe. AI & Soc (2023). https://doi.org/10.1007/s00146-023-01748-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00146-023-01748-4

keywords

Navigation