Language agents reduce the risk of existential catastrophe

Goldstein, Simon; Kirk-Giannini, Cameron Domenico

doi:10.1007/s00146-023-01748-4

Language agents reduce the risk of existential catastrophe

Open Forum
Published: 19 August 2023

(2023)
Cite this article

AI & SOCIETY Aims and scope Submit manuscript

527 Accesses
1 Citation
3 Altmetric
Explore all metrics

Abstract

Recent advances in natural-language processing have given rise to a new kind of AI architecture: the language agent. By repeatedly calling an LLM to perform a variety of cognitive tasks, language agents are able to function autonomously to pursue goals specified in natural language and stored in a human-readable format. Because of their architecture, language agents exhibit behavior that is predictable according to the laws of folk psychology: they function as though they have desires and beliefs, and then make and update plans to pursue their desires given their beliefs. We argue that the rise of language agents significantly reduces the probability of an existential catastrophe due to loss of control over an AGI. This is because the probability of such an existential catastrophe is proportional to the difficulty of aligning AGI systems, and language agents significantly reduce that difficulty. In particular, language agents help to resolve three important issues related to aligning AIs: reward misspecification, goal misgeneralization, and uninterpretability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Alien Versus Natural-Like Artificial General Intelligences

The Categorical Integration of Symbolic and Statistical AI: Quantum NLP and Applications to Cognitive and Machine Bias Problems

Why language clouds our ascription of understanding, intention and consciousness

Article 04 March 2024

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

Notes

The phenomenon we call reward misspecification is sometimes also called “reward hacking” (e.g. by Amodei et al. 2016), “specification gaming” (e.g. by Shah et al 2022), or, in the context of supervised learning, “outer misalignment.”
As we understand it, the problem of goal misgeneralization is similar to the problem of “inner misalignment” (Hubinger et al. 2021).
Hubinger et al. (2021) call this “side-effect alignment.”
See Schroeder (2004) for further discussion of how reward-based learning produces new intrinsic desires for reliable means to one’s goals.
Similar remarks apply to the Decision Transformer architecture developed by Chen et al. (2021).
See Metz (2016).
For more on interpretability in the setting of reinforcement learning, see Glanois et al. (2022).
While we have been careful in this initial exposition to qualify our attributions of mental states like belief and desire to language agents, for the sake of brevity we will omit these qualifications in what follows. It is worth emphasizing, however, that none of our arguments depend on language agents having bona fide mental states as opposed to merely behaving as though they do. That said, we are sympathetic to the idea that language agents may have bona fide beliefs and desires—see our arguments in Goldstein and Kirk-Giannini (2023). Two particularly interesting questions here are whether language agents can respond to reasons and whether, following Schroeder (2004), desires must be systematically related to reward-based learning in ways that language agents cannot imitate.
Some might worry that, because language agents store their beliefs and desires as natural language sentences, their performance will be limited by their inability to reason using partial beliefs (subjective probabilities) and utilities. While we are not aware of work which adapts language agents to reason using partial beliefs and credences, the same kind of process which is used by Park et al. (2023) to assign numerical importance scores to language agents’ beliefs could in principle be used to assign subjective probabilities to sentences and utilities to outcomes. We believe this is an interesting avenue for future research. Thanks to an anonymous referee for raising this issue.
Project available at https://github.com/Significant-Gravitas/Auto-GPT.
Project available at https://github.com/yoheinakajima/babyagi.
See Wang et al. (2023).
For more on the commonsense reasoning ability of language models, see Trinh and Le (2019).
See the recent successes of Voyager at completing tasks in Minecraft (Wang et al. 2023).
See Bubeck et al. (2023) for discussion.
The safety of language agents could also be improved by creating multiple instances of the underlying LLM. In this setting, an action would only happen if (for example) all ten instances recommended the same plan for achieving the goal.
For research in this direction, see Voyager’s skill library in Wang et al. (2023).
Thanks to an anonymous referee for raising these concerns.
Project available at https://segment-anything.com/.
See https://yoshuabengio.org/2023/05/07/ai-scientists-safe-and-useful-ai/ for a recent proposal about how to use AI without developing agents.

References

Amodei D, Clark J (2016) Faulty reward functions in the wild. Blog Post. https://blog.openai.com/faulty-reward-functions/
Amodei D, Olah C, Steinhardt J, Christiano P, Schulman J, Mané D (2016) Concrete problems in AI safety. Manuscript. https://arxiv.org/abs/1606.06565
Bostrom N (2014) Superintelligence: paths, dangers, strategies. Oxford University Press
Google Scholar
Bubeck S, Chandrasekaran V, Eldan R, Gehrke J, Horvitz E, Kamar E, Lee P, Lee YT, Li Y, Lundberg S, Nori H, Palangi H, Ribeiro MT, Zhang Y (2023) Sparks of artificial general intelligence: early experiments with GPT-4. Manuscript. https://arxiv.org/abs/2303.12712
Burns C, Ye H, Klein D, Steinhardt J (2022) Discovering latent knowledge in language models without supervision. Manuscript. https://arxiv.org/abs/2212.03827
Cappelen H, Dever J (2021) Making AI intelligible. Oxford University Press
Carlsmith J (2021) Is power-seeking AI an existential risk? Manuscript https://arxiv.org/abs/2206.13353
Chen L, Lu K, Rajeswaran A, Lee K, Grover A, Laskin M, Abbeel P, Srinivas A, Mordatch I (2021) Decision transformer: reinforcement learning via sequence modeling. NeurIPS. 34:15084–15097
Christiano PF, Leike J, Brown TB, Martic M, Legg S, Amodei D (2017) Deep reinforcement learning from human preferences. NeurIPS. 30:4299–4307
Doshi-Velez F, Kortz M, Budish R, Bavitz C, Gershman S, O'Brien D, Scott K, Schieber S, Waldo J, Weinberger D, Weller A, Wood A (2017) Accountability of AI under the law: the role of explanation. Manuscript. https://arxiv.org/abs/1711.01134
Driess D, Xia F, Sajjadi MSM, Lynch C, Chowdhery A, Ichter B, Wahid A, Tompson J, Vuong Q, Yu T, Huang W, Chebotar Y, Sermanet P, Duckworth D, Levine Vanhoucke SV, Hausman Toussaint KM, Greff K, Florence P (2023) PaLM-E: an embodied multimodal language model. Manuscript. https://arxiv.org/abs/2303.03378
Glanois C, Weng P, Zimmer M, Li D, Yang T, Hao J, Liu W (2022) A survey on interpretable reinforcement learning. Manuscript. https://arxiv.org/abs/2112.13112
Goldstein S, Kirk-Giannini CD (2023) AI wellbeing. Manuscript. https://philpapers.org/archive/GOLAWE-4.pdf
Hubinger E, van Merwijk C, Mikulik V, Skalse J, Garrabrant S (2021) Risks from learned optimization in advanced machine learning systems. Manuscript. https://arxiv.org/pdf/1906.01820.pdf
Krakovna V, Uesato J, Mikulik V, Rahtz M, Everitt T, Kumar R, Kenton Z, Leike J, Legg S (2020) Specification gaming: the flip side of AI ingenuity. Blog Post. https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity
Langosco L, Koch J, Sharkey L, Pfau J, Krueger D (2022) Goal misgeneralization in deep reinforcement learning. In: Proceedings of the 39th International Conference on Machine Learning, pp 12004–12019
Metz C (2016) In two moves, AlphaGo and Lee Sedol redefined the future. Wired 16 March, 2016. https://www.wired.com/2016/03/two-moves-alphago-lee-sedol-redefined-future/
Olah C, Cammarata N, Schubert L, Goh G, Petrov M, Carter S (2020) Zoom in: an introduction to circuits. Distill. https://distill.pub/2020/circuits/zoom-in/
Omohundro S (2008) The basic AI drives. In: Wang P, Goertzel B, Franklin S (eds) Proceedings of the first conference on artificial general intelligence. IOS Press, pp 483–492
Google Scholar
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano PF, Leike J, Lowe R (2022) Training language models to follow instructions with human feedback. NeurIPS. 35:27730–27744
Park JS, O'Brien JC, Cai CJ, Morris MR, Liang P, Bernstein MS (2023) Generative agents: interactive simulacra of human behavior. Manuscript. https://arxiv.org/abs/2304.03442
Perez E, Ringer S, Lukošiūtė K, Nguyen K, Chen E, Heiner S, Pettit C, Olsson C, Kundu S, Kadavath S, Jones A, Chen A, Mann B, Israel B, Seethor B, McKinnon C, Olah C, Yan D, Kaplan J (2022) Discovering language model behaviors with model-written evaluations. Manuscript. https://arxiv.org/abs/2212.09251
Popov I., Heess N, Lillicrap T, Hafner R, Barth-Maron G, Vecerik M, Lampe T, Tassa Y, Erez T, Riedmiller M (2017) Data-efficient deep reinforcement learning for dexterous manipulation. Manuscript. https://arxiv.org/abs/1704.03073
Reed S, Zolna K, Parisotto E, Colmenarejo SG, Novikov A, Barth-Maron G, Gimenez M, Sulsky Y, Kay J, Springenberg JT, Eccles T, Bruce J, Razavi A, Edwards A, Heess N, Chen Y, Hadsell R, Vinyals O, Bordbar M, and de Freitas N (2022) A generalist agent. Manuscript. https://arxiv.org/abs/2205.06175
Rudner TG, Toner H (2021) Key concepts in AI safety: interpretability in machine learning. Center for Security and Emerging Technology Issue Brief
Google Scholar
Schroeder T (2004) Three faces of desire. Oxford University Press
Book Google Scholar
Shah R, Varma V, Kumar R, Phuong M, Krakovna V, Uesato J, Kenton Z (2022) Goal misgeneralization: why correct specifications aren't enough for correct goals. Manuscript. https://arxiv.org/abs/2210.01790
Trinh TH, Le QV (2019) Do language models have common sense? Manuscript. https://openreview.net/pdf?id=rkgfWh0qKX
Turpin M, Michael J, Perez E, Bowman SR (2023) Language models don't always say what they think: unfaithful explanations in chain-of-thought prompting. Manuscript. https://arxiv.org/abs/2305.04388
Wang G, Xie Y, Jiang Y, Mandlekar A, Xiao C, Zhu Y, Fan L, Anandkumar A (2023) Voyager: an open-ended embodied agent with large language models. Manuscript. https://arxiv.org/abs/2305.16291

Download references

Funding

This research was funded by The Center for AI Safety.

Author information

Authors and Affiliations

Dianoia Institute of Philosophy, Australian Catholic University, Fitzroy, Australia
Simon Goldstein
Department of Philosophy, Rutgers University–Newark, Newark, USA
Cameron Domenico Kirk-Giannini

Authors

Simon Goldstein
View author publications
You can also search for this author in PubMed Google Scholar
Cameron Domenico Kirk-Giannini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cameron Domenico Kirk-Giannini.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Goldstein, S., Kirk-Giannini, C.D. Language agents reduce the risk of existential catastrophe. AI & Soc (2023). https://doi.org/10.1007/s00146-023-01748-4

Download citation

Received: 01 June 2023
Accepted: 03 August 2023
Published: 19 August 2023
DOI: https://doi.org/10.1007/s00146-023-01748-4

keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Language agents reduce the risk of existential catastrophe

Abstract

Access this article

Similar content being viewed by others

Alien Versus Natural-Like Artificial General Intelligences

The Categorical Integration of Symbolic and Statistical AI: Quantum NLP and Applications to Cognitive and Machine Bias Problems

Why language clouds our ascription of understanding, intention and consciousness

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

keywords

Navigation

Language agents reduce the risk of existential catastrophe

Abstract

Access this article

Similar content being viewed by others

Alien Versus Natural-Like Artificial General Intelligences

The Categorical Integration of Symbolic and Statistical AI: Quantum NLP and Applications to Cognitive and Machine Bias Problems

Why language clouds our ascription of understanding, intention and consciousness

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

keywords

Search

Navigation