Abstract
Recent advances in natural-language processing have given rise to a new kind of AI architecture: the language agent. By repeatedly calling an LLM to perform a variety of cognitive tasks, language agents are able to function autonomously to pursue goals specified in natural language and stored in a human-readable format. Because of their architecture, language agents exhibit behavior that is predictable according to the laws of folk psychology: they function as though they have desires and beliefs, and then make and update plans to pursue their desires given their beliefs. We argue that the rise of language agents significantly reduces the probability of an existential catastrophe due to loss of control over an AGI. This is because the probability of such an existential catastrophe is proportional to the difficulty of aligning AGI systems, and language agents significantly reduce that difficulty. In particular, language agents help to resolve three important issues related to aligning AIs: reward misspecification, goal misgeneralization, and uninterpretability.
Similar content being viewed by others
Data availability
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
Notes
As we understand it, the problem of goal misgeneralization is similar to the problem of “inner misalignment” (Hubinger et al. 2021).
Hubinger et al. (2021) call this “side-effect alignment.”
See Schroeder (2004) for further discussion of how reward-based learning produces new intrinsic desires for reliable means to one’s goals.
Similar remarks apply to the Decision Transformer architecture developed by Chen et al. (2021).
See Metz (2016).
For more on interpretability in the setting of reinforcement learning, see Glanois et al. (2022).
While we have been careful in this initial exposition to qualify our attributions of mental states like belief and desire to language agents, for the sake of brevity we will omit these qualifications in what follows. It is worth emphasizing, however, that none of our arguments depend on language agents having bona fide mental states as opposed to merely behaving as though they do. That said, we are sympathetic to the idea that language agents may have bona fide beliefs and desires—see our arguments in Goldstein and Kirk-Giannini (2023). Two particularly interesting questions here are whether language agents can respond to reasons and whether, following Schroeder (2004), desires must be systematically related to reward-based learning in ways that language agents cannot imitate.
Some might worry that, because language agents store their beliefs and desires as natural language sentences, their performance will be limited by their inability to reason using partial beliefs (subjective probabilities) and utilities. While we are not aware of work which adapts language agents to reason using partial beliefs and credences, the same kind of process which is used by Park et al. (2023) to assign numerical importance scores to language agents’ beliefs could in principle be used to assign subjective probabilities to sentences and utilities to outcomes. We believe this is an interesting avenue for future research. Thanks to an anonymous referee for raising this issue.
Project available at https://github.com/Significant-Gravitas/Auto-GPT.
Project available at https://github.com/yoheinakajima/babyagi.
See Wang et al. (2023).
For more on the commonsense reasoning ability of language models, see Trinh and Le (2019).
See the recent successes of Voyager at completing tasks in Minecraft (Wang et al. 2023).
See Bubeck et al. (2023) for discussion.
The safety of language agents could also be improved by creating multiple instances of the underlying LLM. In this setting, an action would only happen if (for example) all ten instances recommended the same plan for achieving the goal.
For research in this direction, see Voyager’s skill library in Wang et al. (2023).
Thanks to an anonymous referee for raising these concerns.
Project available at https://segment-anything.com/.
See https://yoshuabengio.org/2023/05/07/ai-scientists-safe-and-useful-ai/ for a recent proposal about how to use AI without developing agents.
References
Amodei D, Clark J (2016) Faulty reward functions in the wild. Blog Post. https://blog.openai.com/faulty-reward-functions/
Amodei D, Olah C, Steinhardt J, Christiano P, Schulman J, Mané D (2016) Concrete problems in AI safety. Manuscript. https://arxiv.org/abs/1606.06565
Bostrom N (2014) Superintelligence: paths, dangers, strategies. Oxford University Press
Bubeck S, Chandrasekaran V, Eldan R, Gehrke J, Horvitz E, Kamar E, Lee P, Lee YT, Li Y, Lundberg S, Nori H, Palangi H, Ribeiro MT, Zhang Y (2023) Sparks of artificial general intelligence: early experiments with GPT-4. Manuscript. https://arxiv.org/abs/2303.12712
Burns C, Ye H, Klein D, Steinhardt J (2022) Discovering latent knowledge in language models without supervision. Manuscript. https://arxiv.org/abs/2212.03827
Cappelen H, Dever J (2021) Making AI intelligible. Oxford University Press
Carlsmith J (2021) Is power-seeking AI an existential risk? Manuscript https://arxiv.org/abs/2206.13353
Chen L, Lu K, Rajeswaran A, Lee K, Grover A, Laskin M, Abbeel P, Srinivas A, Mordatch I (2021) Decision transformer: reinforcement learning via sequence modeling. NeurIPS. 34:15084–15097
Christiano PF, Leike J, Brown TB, Martic M, Legg S, Amodei D (2017) Deep reinforcement learning from human preferences. NeurIPS. 30:4299–4307
Doshi-Velez F, Kortz M, Budish R, Bavitz C, Gershman S, O'Brien D, Scott K, Schieber S, Waldo J, Weinberger D, Weller A, Wood A (2017) Accountability of AI under the law: the role of explanation. Manuscript. https://arxiv.org/abs/1711.01134
Driess D, Xia F, Sajjadi MSM, Lynch C, Chowdhery A, Ichter B, Wahid A, Tompson J, Vuong Q, Yu T, Huang W, Chebotar Y, Sermanet P, Duckworth D, Levine Vanhoucke SV, Hausman Toussaint KM, Greff K, Florence P (2023) PaLM-E: an embodied multimodal language model. Manuscript. https://arxiv.org/abs/2303.03378
Glanois C, Weng P, Zimmer M, Li D, Yang T, Hao J, Liu W (2022) A survey on interpretable reinforcement learning. Manuscript. https://arxiv.org/abs/2112.13112
Goldstein S, Kirk-Giannini CD (2023) AI wellbeing. Manuscript. https://philpapers.org/archive/GOLAWE-4.pdf
Hubinger E, van Merwijk C, Mikulik V, Skalse J, Garrabrant S (2021) Risks from learned optimization in advanced machine learning systems. Manuscript. https://arxiv.org/pdf/1906.01820.pdf
Krakovna V, Uesato J, Mikulik V, Rahtz M, Everitt T, Kumar R, Kenton Z, Leike J, Legg S (2020) Specification gaming: the flip side of AI ingenuity. Blog Post. https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity
Langosco L, Koch J, Sharkey L, Pfau J, Krueger D (2022) Goal misgeneralization in deep reinforcement learning. In: Proceedings of the 39th International Conference on Machine Learning, pp 12004–12019
Metz C (2016) In two moves, AlphaGo and Lee Sedol redefined the future. Wired 16 March, 2016. https://www.wired.com/2016/03/two-moves-alphago-lee-sedol-redefined-future/
Olah C, Cammarata N, Schubert L, Goh G, Petrov M, Carter S (2020) Zoom in: an introduction to circuits. Distill. https://distill.pub/2020/circuits/zoom-in/
Omohundro S (2008) The basic AI drives. In: Wang P, Goertzel B, Franklin S (eds) Proceedings of the first conference on artificial general intelligence. IOS Press, pp 483–492
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano PF, Leike J, Lowe R (2022) Training language models to follow instructions with human feedback. NeurIPS. 35:27730–27744
Park JS, O'Brien JC, Cai CJ, Morris MR, Liang P, Bernstein MS (2023) Generative agents: interactive simulacra of human behavior. Manuscript. https://arxiv.org/abs/2304.03442
Perez E, Ringer S, Lukošiūtė K, Nguyen K, Chen E, Heiner S, Pettit C, Olsson C, Kundu S, Kadavath S, Jones A, Chen A, Mann B, Israel B, Seethor B, McKinnon C, Olah C, Yan D, Kaplan J (2022) Discovering language model behaviors with model-written evaluations. Manuscript. https://arxiv.org/abs/2212.09251
Popov I., Heess N, Lillicrap T, Hafner R, Barth-Maron G, Vecerik M, Lampe T, Tassa Y, Erez T, Riedmiller M (2017) Data-efficient deep reinforcement learning for dexterous manipulation. Manuscript. https://arxiv.org/abs/1704.03073
Reed S, Zolna K, Parisotto E, Colmenarejo SG, Novikov A, Barth-Maron G, Gimenez M, Sulsky Y, Kay J, Springenberg JT, Eccles T, Bruce J, Razavi A, Edwards A, Heess N, Chen Y, Hadsell R, Vinyals O, Bordbar M, and de Freitas N (2022) A generalist agent. Manuscript. https://arxiv.org/abs/2205.06175
Rudner TG, Toner H (2021) Key concepts in AI safety: interpretability in machine learning. Center for Security and Emerging Technology Issue Brief
Schroeder T (2004) Three faces of desire. Oxford University Press
Shah R, Varma V, Kumar R, Phuong M, Krakovna V, Uesato J, Kenton Z (2022) Goal misgeneralization: why correct specifications aren't enough for correct goals. Manuscript. https://arxiv.org/abs/2210.01790
Trinh TH, Le QV (2019) Do language models have common sense? Manuscript. https://openreview.net/pdf?id=rkgfWh0qKX
Turpin M, Michael J, Perez E, Bowman SR (2023) Language models don't always say what they think: unfaithful explanations in chain-of-thought prompting. Manuscript. https://arxiv.org/abs/2305.04388
Wang G, Xie Y, Jiang Y, Mandlekar A, Xiao C, Zhu Y, Fan L, Anandkumar A (2023) Voyager: an open-ended embodied agent with large language models. Manuscript. https://arxiv.org/abs/2305.16291
Funding
This research was funded by The Center for AI Safety.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Goldstein, S., Kirk-Giannini, C.D. Language agents reduce the risk of existential catastrophe. AI & Soc (2023). https://doi.org/10.1007/s00146-023-01748-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00146-023-01748-4