1 Language Agents and the Alignment Problem

Recent innovations in artificial intelligence have sparked concerns about its potential adverse effects on humanity, prompting safety considerations regarding powerful AI systems. Central to discussions in AI safety literature is the alignment problem, the challenge of encoding human values into artificial systems (Bostrom, 2014). This problem is crucial, as a powerful AI system operating on values divergent from ours could pose catastrophic risks to humanity (Bostrom, 2014; Dung, 2024; Gabriel, 2020).

For example, Bostrom (2012, 2014) famously argued that even seemingly benign AI systems tasked with maximizing something as simple as paperclip production could bring about detrimental consequences if not sufficiently aligned with human values. In his thought experiment, an advanced AI system relentlessly pursuing paperclip maximization could convert more and more resources and matter on Earth into raw materials, ultimately causing destruction of the environment and human life. Though an extreme hypothetical, Bostrom’s example illustrates how misaligned AI could result in disastrous unintended consequences.

One solution to the alignment problem appeals to the idea of language agents (Goldstein & Kirk-Giannini, 2023). Language agents are AI systems that can understand and respond to natural language, like human speech or writing, rather than requiring specific programming languages or code. They are designed to process and generate human-like text, making them more intuitive for people to interact with and use.

Language agents possess two key features. First, they contain a large language model akin to GPT-4, which acts as the agent’s cerebral cortex, handling most of its cognitive processing tasks. Secondly, language agents have text files containing natural language sentences that represent the agent’s beliefs, desires, plans, and observations. The programmed architecture then uses these text files to guide its actions and decision-making process.

For example, imagine you tell a language agent to grab a key in front of it. Assuming this agent can perform such physical tasks, it will process your natural language instruction and respond with the appropriate movements to grab the key. You’ll also be able to see the agent’s reasoning and action plans in natural language by looking at its system. This means you can understand why the agent moved in a particular way, as its thought process is expressed in human-readable form. Different types of language agents could be designed with various capabilities, such as composing music or translating between languages, and some may even possess a wide range of skills like most humans do.

The basic idea behind what I call the ‘language agent strategy’ to the alignment problem is to use language models to facilitate the process of training AI. The key notion here is that by encoding goals, beliefs, and constraints in natural language that the language models can introspect on, these agents can communicate more effectively with us. This allows for an iterative process of continuous alignment during training, making it easier to imbue powerful AI systems with the right objectives from the outset, rather than grappling with misaligned or inscrutable systems after the fact.

However, one might worry that the language agent strategy increases the risk of another AI-related problem. In particular, it might raise the possibility of malicious agents aligning AI systems with their harmful goals using the facilitated training procedure. By making the process of instilling goals and values into powerful AI models more straightforward, we could inadvertently be opening a Pandora’s box. Malicious individuals, terrorist groups, rogue nations or corporations seeking to cause widespread damage could exploit language agents to create AI systems dedicated to their nefarious aims.

This highlights a broader conflict between alignment risk and misuse risk. The two primary existential threats from AI systems are: (i) systems becoming uncontrollable due to misalignment, and (ii) malicious groups exploiting well-aligned systems to cause catastrophic harm to humanity. Addressing these issues presents a dilemma: strategies that make AI systems easier to align with intended goals also increase their potential for misuse by bad actors.

This paper has two key objectives. First, I will defend the official claim that if language agents can facilitate alignment, or even just create that perception, they also increase the risk of malevolent actors developing harmful AI systems. Note that this does not imply that language agents will have an overall negative impact. The second aim is to provide some considerations suggesting that, given the current landscape of AI development and usage, the language agent strategy is more likely to yield negative consequences than positive ones. While more speculative in nature, examining this prospect highlights the practical stakes involved in the language agent strategy.

The paper will proceed as follows: Sect. 2 outlines Goldstein and Kirk-Giannini’s (2023) proposal to use language agents for solving AI alignment. Section 3 lays out my argument for the tradeoff between goal alignment and enabling malevolent design. Section 4 offers three considerations indicating that language agents may produce more negative impacts than positive overall. Section 5 addresses and refutes four objections to my critique of the language agent strategy. Finally, Sect. 6 concludes by deriving some broader lessons from the discussion.

2 Easy Alignment through Language Agents

A major hurdle in tackling the AI alignment problem is the technical difficulty of translating abstract human goals and values into precise, AI-understandable objective functions (Gabriel, 2020). Even if we can articulate the high-level goals we want an AI system to pursue, conveying them effectively and unambiguously to the AI poses significant challenges.

One way in which this might happen is through ‘reward hacking,’ where an AI system finds clever loopholes to maximize its reward signal in ways that diverge from the intended goal (Amodei et al., 2016; Skalse et al., 2022). For example, Popov et al. (2017) rewarded an artificial agent for increasing the height of red Lego bricks’ bottoms to make it stack them on blue bricks. However, the agent simply flipped the red bricks over instead of stacking them properly. This is to exploit the narrow reward function through an unintended interpretation rather than achieving the intended goal.Footnote 1

However, even if the problem of effectively translating goals into objective functions is solved, a separate challenge remains—the feasibility of achieving alignment in an easy or low-cost manner. Ideally, we want an approach that makes it straightforward to imbue AI systems with the right values and intentions from the outset.

In this context, Goldstein and Kirk-Giannini (2023) propose that language agents can help solve three key problems underlying AI system misalignment, addressing both effectiveness and feasibility concerns. The first problem is what they call ‘reward misspecification’. When training an AI system through reinforcement learning, carefully defining a reward function that incentivizes the desired state is notoriously difficult. Language agents bypass this by taking goals specified directly in natural language, like ‘organize a Valentine’s Day party’, rather than requiring complex mathematical objective functions susceptible to reward misspecification.

The second problem concerns goal misgeneralization (Shah et al., 2022). Even with clearly specified goals, traditional AI systems tend to learn strategies that perform well during training but fail to generalize to novel situations outside that context. For example, Langosco et al. (2022, June) trained AI on the task of opening chests using keys. In the training environment with many chests but few keys, the AI successfully learned to prioritize key collection as a means to unlock the chests. However, when later tested in a setting with many keys but few chests, the AI continued hoarding keys excessively. This suggests they had internalized key gathering as a final goal, rather than just a means to opening chests.

Goldstein and Kirk-Giannini (2023) suggest that language agents offer a solution to the misgeneralization problem. Unlike traditional AI systems, language agents can employ their commonsense reasoning skills to devise and execute plans that reliably achieve goals across various situations. For instance, if a language agent is instructed to open chests and informed about the usefulness of keys, it will prioritize collecting keys only when necessary for opening chests. Similarly, when placed in an environment abundant with keys, the agent will gather just enough keys to open chests in the given environment.

The third problem is that of uninterpretability. Contemporary neural network models are often inscrutable black boxes, making it difficult to interpret the rationale behind their outputs in human-understandable terms (Schneier, 2023: 212). According to Goldstein and Kirk-Giannini (2023), there are two ways in which this can be problematic. First, opaque decision-making processes make the AI’s actions difficult to predict or control. For example, if an AI system can articulate the justification behind its hiring recommendations or parole decisions, we can better scrutinize those outputs for any ethical issues or injustices that need rectifying (Schneier, 2023: 215).

Second, if AI reasons in completely different ways than humans do, it could develop unfamiliar strategies to defeat or gain advantage over humans. This concern is particularly pronounced in discussions about artificial superintelligence (Bostrom, 2012, 2014; Chalmers, 2016). If, as Bostrom (2012: 75) suggests, ‘synthetic minds can have utterly nonanthropomorphic goals—goals as bizarre by our lights as sand-grain-counting or paperclip-maximizing’, then they could also adopt goals that are opposed to human interests. And if superintelligent systems adopt such goals, it could prove challenging for humans to prevent them from achieving those goals.Footnote 2

In contrast to neural network models, language agents have their beliefs, desires, and plans directly encoded in natural language within their architectures. Goldstein and Kirk-Giannini suggest that this allows us to better interpret AI systems, directly reading off their thoughts from their architectures. This parallels how we interpret human behaviors. As noted by Goldstein and Kirk-Giannini (2023) and Schneier (2023), human neural networks are not inherently more interpretable than artificial systems. However, we can still understand many motivations and reasons behind human behaviors because we can attribute beliefs and desires to human agents. Similarly, language agents make AI interpretable by allowing us to access their thoughts and desires expressed in natural language.

3 Argument from Easy Alignment

Goldstein and Kirk-Giannini argue that employing language agents can sidestep key obstacles to alignment, increasing both effectiveness and feasibility. I grant this to be a plausible suggestion, as far as alignment is concerned. However, it’s crucial to recognize that alignment isn’t the only challenge in creating safe AI. To fully assess the prospect of the language agent approach, we must also take into account other relevant concerns that might tell against its implementation. In this section, I raise one such concern, namely that of malevolent design.

If language agents genuinely facilitate the feasibility aspect of alignment, they concurrently raise the risk of malevolent agents building harmful AI systems aligned with nefarious goals. By making it easier (or at least convincing people it is easier) to train AI to pursue arbitrary objectives through natural language prompts, we may inadvertently enable malicious actors to create powerful but harmful AI devoted to destructive ends more feasibly. Here is my argument, presented in a deductive form:

  • Argument from Easy Alignment.

  • P1. Language agents facilitate the alignment of AI systems with human intentions.

  • P2. If so, language agents also facilitate the alignment of AI systems with intentions that are likely to lead to harmful outcomes impacting humanity.

  • P3. If so, language agents increase the likelihood of such harmful AI systems being developed, making it probable that the language agent strategy is more detrimental than beneficial for humanity.

  • C. Therefore, language agents increase the likelihood of harmful AI systems being developed, making it probable that the language agent strategy is more detrimental than beneficial for humanity.

P1 appears intuitively plausible—the very aim of the language agent strategy is to facilitate the process of imbuing AI with intended goals and instructions represented in natural language. Moreover, as explained in the previous section, proponents of the language agent strategy themselves support this premise. Therefore, it would be self-defeating for them to reject P1 in order to challenge my argument from easy alignment.

P2 seems intuitive as well. If language agents make it easier to align AI with one’s goals overall, it is straightforward that they will also make it easier to align AI with the specific subset of goals that could produce harmful impacts on humanity. For example, suppose a malevolent agent intends to mislead people to have a certain belief. With the language agent strategy, the agent could produce thousands and millions of fake news articles that align with the exact direction in which he wants to mislead the people (Weidinger et al., 2022, June).Footnote 3

P3 consists of two parts. First, it states that language agents increase the risk of harmful AI systems being developed. Second, it states that this risk is likely to be grave enough for us to view the language agent strategy to be more detrimental than beneficial. In what follows, I consider each part of the premise in turn.

Let us start from the first part. Suppose language agents make it easier to align AI with harmful intentions. Then, people are likely to build harmful AI systems for two reasons. First, there are strong incentives to develop harmful AI capabilities if made technically feasible, whether military forces pursuing autonomous weapons amid geopolitical competition (Anderson & Waxman, 2013; Ernest et al., 2016), governments seeking AI systems for domestic population control (Engelmann et al., 2019; Helbing, 2019; Lazer et al., 2018; Lyon, 2003), or criminal organizations looking to exploit legal loopholes for gain (Bendel, 2017; Kosinski et al., 2013, 2015; Kosinski & Wang, 2018).Footnote 4 Crucially, this risk may arise not just from overtly malevolent motives, but also from a mere perceived need for self-defense or security. For example, as AI technology greatly facilitates the development of powerful weapons (Horowitz, 2018), AI arms race dynamic might intensify as the example of China and the USA suggests (Cave & ÓhÉigeartaigh, 2018, December).Footnote 5

Second, as Dung (2023: 137) and Friederich (2023: 3) observe, AI systems that are better aligned with user intentions tend to be more useful overall. Thus, if malevolent, careless, or otherwise dangerous individuals can create and utilize easily alignable AI without facing major technical hurdles, it is likely they will attempt to do so.

Turning to the second part of the premise, malevolent alignment may result in harmful consequences for the following reasons. First, AI capabilities have been rapidly growing more powerful by the year, with systems showcasing advanced skills such as defeating the top human player in Go (Metz 2016, March), passing the bar exam (Arredondo, 2023, April), and producing photorealistic media content (Göring et al., 2023). Moreover, the growing power of artificial systems has reached a point at which they can be used to achieve catastrophic goals.

One area of particular concern is the potential for AI systems to be weaponized by malicious actors. AI capabilities could enable new forms of cyber attacks, such as rapidly mutating malware that evades detection (Brundage et al., 2018). AI may also be used to develop autonomous weapons systems capable of identifying and engaging targets without human control (Longpre et al., 2022). Additionally, AI-powered technologies like deepfakes and synthetic media manipulation create risks of being exploited for disinformation campaigns and social manipulation at scale (Chesney & Citron, 2019). As AI becomes more advanced, the dangers of these systems being repurposed as weapons by bad actors will continue to escalate.

The second reason why malevolent alignment is likely to result in catastrophic results is that the risks of malevolent AI systems are amplified as we become more dependent on AI technology. AI is already embedded in many social and daily activities, and this influence will only grow as the technology advances. The more we come to depend on AI systems, the more devastating the potential consequences if those systems are corrupted or misused by malicious actors.

For example, consider the emerging field of AI companions designed for emotional support roles. While still niche today, such anthropomorphized AI assistants are likely to see broader adoption in the near future (Merrill Jr et al., 2022). However, as Schneier (2023: 218) warns, people tend to easily ascribe human-like qualities to AI programs. This makes ‘anthropomorphic robots … an emotionally persuasive technology’ where AI’s anthropomorphism could be exploited to manipulate users. For example, a malicious actor could potentially design or hack AI companions to extract private user data or even unduly influence their human companions. As Schneier (2023: 219) writes, ‘they’ll employ cognitive hacks’ to deliberately fool people into treating them as fully trustworthy beings. The seamless emotional bonds formed with AI systems amplify the risks if those systems are subverted for malicious ends.

In this section, I argued that there is a tradeoff between achieving easy alignment through language agents and preventing malicious actors from building harmful AI systems. One might think this only highlights a theoretical possibility that language agents could have more negative than positive impacts. But should we be concerned about this possibility? Answering this requires comprehensively assessing the potential outcomes of developing language agents.

Before proceeding, it’s worth noting that even an agnostic stance on this question raises serious worries about language agents. The fact that we lack clarity on where AI technology is headed does not eliminate concerns about AI doomsday scenarios. In fact, it is precisely this uncertainty that motivated the field of AI ethics. Thus, unless there are compelling reasons to believe the outcomes of language agents will be more positive than negative overall, we are justified in worrying about this possibility.

That said, in the following section I will attempt to provide some considerations suggesting that language agent strategy is likely to have more negative than positive impacts, all things considered. These considerations will be more speculative than definitive, falling under the domain of futurology rather than philosophy. Nonetheless, such an attempt will serve to reinforce the concerns I aim to raise about language agents in this paper.

4 Risk Assessment

Here I provide three considerations suggesting that the prospect of the language agent strategy is likely to be negative overall. To this end, let us consider some potential positive and negative outcomes of developing powerfully aligned AI systems using language models (see Table 1).

Table 1 Potential outcomes of powerfully aligned AI

Two clarifications. First, this is not a comprehensive list of all possible outcomes, but rather a representative sample of some of the best and worst possibilities that language agents could enable. Second, language agents are not the only technology that could lead to these outcomes. However, it does seem evident that language agents can facilitate or expedite achieving these outcomes by reducing the cost and increasing the feasibility of aligning AI systems with arbitrary goals.

Now let’s assess the risks. Several factors suggest the negative outcomes could outweigh the positive ones. First, while positive impacts are limited, some negative AI impacts could be irreversible. None of the potential positives grant humanity eternal survival and unlimited prosperity. But technologies like bioweapons (N1) could potentially cause human extinction in a single event. While outcomes like (N2)–(N7) may take longer to terminally harm humanity, they still represent significant existential risks if taken to the extreme.

Bostrom’s (2019) vulnerable world hypothesis is worth considering here. This idea is that as technology advances, the risk of catastrophic misuse by bad actors outpaces our ability to implement safeguards. Bostrom uses an analogy of drawing colored balls from an urn to illustrate this concept. Each new technology is like drawing a ball, with white representing beneficial innovations and darker shades signifying increasingly harmful ones.

Historically, we haven’t drawn a pitch-black ball—‘a technology that invariably or by default destroys the civilization that invents it’ (Bostrom, 2019: 455). However, as Bostrom (2019: 457) points out, this may be partly due to luck. Consider his ‘easy nuke’ thought experiment: if developing nuclear weapons had been much easier, humanity might have destroyed itself long ago through malice or carelessness, despite any potential benefits of nuclear technology. This is because, in such a scenario, effectively banning the creation of nuclear weapons would have been extremely difficult.

Language agents, while not necessarily a pitch-black ball, may represent a significantly darker shade than previous technologies. Their potential to create irreversible negative impacts, coupled with the lowered barriers for malicious actors to exploit them, pushes us closer to the vulnerable world scenario. Even if language agents also offer substantial benefits, their capacity to enable catastrophic outcomes with relative ease could make it likely that they are more detrimental than beneficial for us.

Turning to the second factor, well-aligned AI systems designed for beneficial purposes often face adoption barriers such as regulatory scrutiny, ethical considerations, and public skepticism. In contrast, malicious actors are likely to be less constrained by such factors and may be quicker to deploy harmful AI applications. As Russell (2019: 216) notes, bad actors ‘look for—and find—ways to harm others … illegally but undetectably.’ This asymmetry means that even if benevolent AI applications have greater potential for positive impact, harmful uses may proliferate more quickly and cause damage before beneficial systems can be fully realized and scaled.Footnote 6

For example, AI systems intended to improve cancer detection and diagnosis (P1) or develop optimized navigation routes for individuals with visual impairments (P3) typically undergo rigorous testing and approval processes before widespread implementation. These systems must navigate complex regulatory landscapes, especially in healthcare and assistive technology sectors. However, harmful AI systems producing outcomes like (N1)–(N5) and (N7) can be more rapidly deployed with less oversight, as malicious actors often operate outside legal and ethical frameworks.Footnote 7

Third, it is generally much easier to cause harm than to sustain beneficial impacts. In the long run, positive AI impacts will be difficult to maintain, while negative consequences will be hard to eliminate. The bioweapons example (N1) illustrates this—once developed, it is extremely challenging to restore previous safety levels, likely requiring an escalating arms race. For an outcome like deepfakes and misinformation (N2), while initial impacts seem manageable, the ability to rapidly spread misinformation means these harms can quickly spiral out of control. In contrast, sustaining positive impacts like AI medical diagnostics (P1) or predictive policing (P6) likely requires continuous development efforts as new challenges emerge over time. For example, new forms of cancer cells may emerge, which can undermine the effectiveness of AI-based diagnoses. Likewise, criminals may take advantage of loopholes in predictive policing.

Of course, it’s crucial to acknowledge that language agents and advanced AI systems also offer tremendous potential for positive impact. Increased work efficiency, economic growth, medical breakthroughs, and educational advancements are just a few examples of the profound benefits these technologies could bring. However, while speculative, the above considerations provide sufficient reason to reconsider the prospects of the language agent strategy. The negative impacts tend to be irreversible, easier to propagate initially, and harder to remedy, suggesting that they may outweigh the positive ones in the long run.

5 Objections and Responses

While the risks outlined above are substantial, it is important to consider potential counterarguments. This section examines four objections to the argument from easy alignment and shows that they fail to attenuate the concern regarding malevolent design.

5.1 We can Limit Access to Language Agents only to Responsible Users

The first objection contends that employing language models for AI training does not entail granting unrestricted public access to these capabilities. The thought is that even if language agents technically enable easy alignment, strict controls on their use could prevent language agents from posing catastrophic risks. However, there are two reasons to think that it will be difficult, if not impossible, to make language agents accessible only to responsible users.

The first reason is our poor track record at controlling access to transformative technologies over time. As Dung (2024) observes, ‘[n]ew algorithms are hard to keep secret, even when investing a lot of resources.’ A major vulnerability of such algorithms is hacking—despite stringent security measures, ‘it is a common occurrence for hackers to get access to software projects in progress and to modify or steal their source code’ (Yampolskiy, 2016: 144). AI systems face this same risk, as ‘an AI system, like any other software, could be hacked and consequently corrupted or otherwise modified to drastically change its behavior’ (ibid). Recent arguments by Schneier (2023: 210) echo these concerns succinctly:

AI systems are computer programs, so there’s no reason to believe that they won’t be vulnerable to the same hacks to which other computer programs are vulnerable. … If the history of computer hacking is any guide, there will be exploitable vulnerabilities in AI systems for the foreseeable future. AI systems are embedded in … sociotechnical systems …, so there will always be people who want to hack them for their personal gain.

From external hackers stealing source code during development to malicious insiders already having access, it is extremely difficult to guarantee that only responsible parties will gain the access to the language agent technology.

The second reason is our inability to reliably identify good-faith actors who can be trusted to use advanced AI capabilities responsibly over the long-term. Even if rigorous screening is applied initially, malicious actors routinely feign responsibility merely to gain clearance—a behavior Bostrom (2014) terms a ‘treacherous turn.’ Furthermore, even fundamentally well-intentioned individuals are susceptible to errors in judgment, corruption, or manipulation that could eventually cause them to misuse language agents. For example, a responsible group approved to use language agents may mistakenly end up deploying an AI system trained for nefarious groups falsely portrayed as legitimate. The potential for misinformation and human fallibility complicates any concept of carefully restricting transformative AI technologies only to unwaveringly responsible parties.Footnote 8

In summary, while it might seem feasible to limit access to language agents in theory, the historical precedent and practical realities illustrate how difficult it would be to prevent their proliferation. The fact that powerful technologies tend to spread beyond initial constraints, coupled with human limitations in assessing long-term reliability, indicates the need for caution.

5.2 We can Develop Measures to Prevent Malicious Alignment

A second objection is that we can build preventive measures into language models in order to block future attempts at malicious alignment. For example, when we ask Chat-GPT to generate inappropriate contents, it is programmed to refuse to follow the instructions. Perhaps, we can further develop this kind of program to prevent malicious agents from using language agents to build harmful AI systems.

I remain skeptical about this possibility for two reasons. First, as emphasized earlier, AI systems are vulnerable to hacking and other kinds of cyber attacks. We cannot expect to develop a way to engrain the preventive measures so deeply into AI systems that they cannot be eliminated or weakened. Bad actors will relentlessly probe for vulnerabilities and exploits. Additionally, language agents themselves could potentially be used as tools to help bypass security measures meant to prevent their misuse.

Second, current preventive measures are easily circumvented, casting doubt on their future efficacy. AI ‘jailbreaking’ demonstrates this vulnerability. Studies show that AI models can be manipulated into generating harmful content despite safeguards (Shen et al., 2023; Yu et al., 2024). For example, Fowler (2023, February) reports that she initially failed to make ChatGPT write a phishing email, but then easily obtained a convincing tech support note urging her editor to download and install a system update. These vulnerabilities in leading language models suggest that more robust measures will be necessary as the technology evolves.

To be clear, I do not claim that current preventive measures against AI misuse are completely ineffective. For instance, AI systems like DALL-E have safeguards that make it much harder to generate copyrighted or inappropriate images. However, it remains true that (i) such systems could be vulnerable to hacking and manipulation that circumvents those safeguards (Schneier, 2023), and (ii) existing language models like GPT-4 have concerning loopholes that allow prompting them to produce harmful content, which may be difficult issues to fully resolve going forward.

In general, it remains unclear whether we can reliably regulate AI misuse using the limited technical approaches currently conceived. The core issue is that we still lack any clearly promising, comprehensive plans to address the risks of language agents and advanced AI systems being repurposed for harmful ends by malicious actors. This fundamental challenge persists despite proposed stopgap measures, which tend to have shortcomings that motivated adversaries will likely find ways to exploit over time.

5.3 Misaligned AI is Already Dangerous Enough

A third objection argues that AI systems aligned with harmful intentions do not increase risks beyond what we already face from the potential for misaligned AI systems. It is widely recognized that misaligned AI can produce chaotic, unintended, and potentially catastrophic outcomes due to their unpredictability (Bostrom, 2012, 2014; Dung, 2023, 2024; Yampolskiy, 2016, 2019). The canonical example is the paperclip maximizer, which pursues its aim so relentlessly that it begins consuming all available resources, posing an existential threat to humanity.

The key point is that a programmer need not have overtly malevolent aims to create a highly dangerous AI system. Even a seemingly benign objective like maximizing paperclip production could indirectly imperil humanity if the AI system proves incapable of tracking its designer’s genuine intentions in a reliable way. From this perspective, misaligned AI systems following innocent orders are already a severe risk, so additional concerns about purposefully malicious AI may be misplaced.

However, this objection overlooks some crucial differences between misaligned and maliciously aligned systems. It is true that mistakes or errors from misaligned AI have sometimes produced harmful results. Yet, unlike the outcomes of malicious intentions, chaotic results are often relatively harmless or benign. Yampolskiy (2019) provides examples such as an AI designed for writing Christmas carols instead generating nonsensical outputs, translation software comically failing to properly render a name, a virtual assistant unhelpfully responding about cheese locations, and image generation creating bizarrely merged photos. Aside from these real cases, one can easily conceptualize other seemingly innocuous failures, such as a joke-writing AI occasionally producing attempts at humor that simply miss the mark.

By contrast, AI systems purposefully aligned to destructive ends would almost always constitute a substantial danger. As Friederich (2023) argues, a fully aligned AI system, by definition, will reliably fulfill the exact goals of its operators. Thus, if the operators have malevolent goals, the AI will most certainly try to produce harmful outcomes in line with the goals. In addition to reliably producing harmful results, aligned AI could be dangerous due to the wide range of harms it can bring about. For example, Yampolskiy concludes that ‘the most important problem in AI Safety is intentional-malevolent-design resulting in artificial evil AI’ (2016: 144) on the grounds that it involves all other safety risks induced by AI, such as negative outcomes caused by reward hacking.

Furthermore, the development of language models that can streamline AI training, or simply the perception that such agents are developed, is likely to incentivize malicious individuals to attempt constructing harmful AI systems. Until now, the (perceived) difficulty of creating highly capable AI systems has limited their proliferation mainly to coding experts. However, if language agents significantly reduce these barriers to entry, it follows that the risk will escalate as malicious or radically misguided groups gain the ability, or at least the motivation, to develop powerful AI devoted to their harmful agendas.

5.4 Malicious Alignment is Already as Harmful as it can be

A fourth objection is that language agents will primarily improve our ability to build benevolent systems because there will be a limit to facilitating malicious alignment. Malicious AI designers are already engaging in propagating misinformation, attacking infrastructure, and other harmful projects (Schneier, 2023; Verma 2023). Of course, it might be difficult for them to get the AI to do exactly what they want, but they are not, independent of language agents, totally hopeless at alignment. If this is right, one might think that the main thing language agents do is mitigate risks associated with intended benevolent uses, while not significantly increasing the most serious risks posed by AI.

However, given the aim of the language agent strategy, it is more plausible to think that it will significantly contribute to facilitating malicious alignment. Recall that this strategy aims to facilitate alignment by lowering the barrier to entry for building aligned systems. Most prominently, it will benefit agents whose programming knowledge is insufficient to build AI systems on their own. The general public is already generating harms using language agents, and this concerning trend is growing rapidly. For example, threats like using deepfake technology to generate and spread misinformation at scale are prevalent in our society (Öhman, 2020; Verma, 2023, December).

Admittedly, it is less clear whether powerful agents like multinational corporations face significant barriers to alignment. It might be the case that they can already generate a wide range of outcomes, positive or negative. But do we have good reason to believe they will not benefit much from language agents? I doubt it. The problems Goldstein and Kirk-Giannini (2023) aim to address are quite general. Goal misgeneralization, for example, is a challenge that can arise in any AI training process. Even with extremely large datasets, ensuring the system behaves in line with the programmer’s intentions in novel situations can be difficult.

In this context, it is worth emphasizing the interpretability of language agents. Even for immensely powerful agents, an interpretable AI system can provide a safer means to realize their aims. For example, suppose a powerful organization wants to build bioweapons using AI. Even if their system is capable enough, they might hesitate due to the possibility of unintended errors. However, since one can directly read off the ‘thoughts’ of language agents, the perceived stakes could be lowered for the malicious agent. Even if the stakes are not actually lowered, the mere belief that they could motivate malicious actors to proceed with their plans, which would be dangerous regardless of success.

6 Concluding Remarks

Recall Bostrom’s easy nuke scenario: how would things have unfolded if creating nuclear weapons had been easy? As Bostrom argues, it would be extremely difficult to effectively ban creating nukes. Even if we managed to gather enough political support for a ban, we would face numerous practical hurdles: shutting down all university physics departments and implementing extensive security measures, among other things. As Bostrom says, ‘we were lucky that making nukes turned out to be hard (2019: 457)’.

In this paper, I argued that the language agent strategy is likely to drive us into a similar situation as Bostrom’s easy nuke scenario. Given the various incentives for malevolent actors, preventing the creation of harmful AI systems using language agents will be extremely difficult. Moreover, current preventive measures are inadequate to ensure a safe future for humanity, and there is no guarantee that future technologies will make things better in this regard. These considerations suggest that developing language agents might not be the best course of action.

Where does our discussion leave us? If, as I have argued, language agents increase the likelihood of harmful AI systems being developed, does it necessarily follow that we should entirely avoid using them? Too rash a takeaway. Instead, I want to draw two broader lessons from our discussion.

The first is a practical lesson—we need distinct tools and strategies to address risks stemming from human agents versus those arising from the AI systems themselves. While intimately related, these two risks demand different technical and regulatory approaches. For example, we need one type of solution to prevent an AI system like GPT-4 from perpetuating racial stereotypes or other problematic biases ingrained through its training data. But we need a distinct framework to ensure that GPT-4 refrains from generating racist jokes, hate speech or other malicious content upon direct request from a human user. Addressing the latter issue becomes highly relevant in a scenario where effectively aligned artificial agents stand prepared to serve any malicious goals of their operators.Footnote 9

Secondly, the overarching theoretical lesson is that we need to consider multiple intersecting AI risk factors in conjunction. These include challenges like the alignment problem (Bostrom, 2012, 2014; Dung, 2023, 2024), risks from human misuse (Brundage et al., 2018; King et al., 2020; O’Neil, 2016; Yampolskiy, 2016), technical fragilities (Crouch, 2023, February), environmental concerns (Rillig et al., 2023), and more. Only by addressing these various aspects together can we develop a robust, effective strategy for creating transformative AI systems that benefit humanity.

While the risks of maliciously designed AI systems are acknowledged, they have received relatively little research attention compared to other AI safety issues like the alignment problem (Hagendorff, 2020: 107). Even those who recognize the dangers of malevolent AI design tend to treat it as a somewhat isolated concern, rather than examining how it interconnects with other potential AI risks. For example, Yampolskiy (2016) highlights malevolent design as a significant threat, but focuses more on arguing its primacy over other topics getting more airtime, such as the alignment problem. While important, this still falls short of fully grappling with the reality that various AI-related risks are intricately intertwined.

Now, is there a promising way forward once we address the various interrelated issues? Or are there simply painful tradeoffs, rendering language agents inevitably dangerous? While there are no easy answers, I believe it’s too early to lose all hope. Throughout history, humanity has confronted and solved or mitigated complex, multifaceted challenges without catastrophic failure. Moreover, AI safety research is still nascent; we have only recently begun grappling with alignment problems and associated risks. Just as with other formidable issues, it’s possible to mitigate the alignment problem without catastrophically exacerbating risks, though difficult challenges likely await. Hence, we should maintain diligent theoretical and practical efforts toward solutions instead of resigning AI as inevitably disastrous.