1 Introduction

The capabilities of artificial intelligence (AI) are growing rapidly. As AI improves up to and beyond human performance levels in increasingly many areas, actors in business, science, government and elsewhere will acquire incentives to turn important decisions over to AI systems. Our current situation—in which we can foresee the large influence of AI on the near future, but aren’t yet overwhelmed by developments beyond our control—offers a chance to consider what sorts of principles should govern AI decision-making.

Two goals stand out as desirable. First, advanced AI systems should behave morally, in the sense that their decisions are governed by appropriate ethical norms. Second, such systems should behave safely, in the sense that their decisions don’t unduly harm or endanger humans.

These two goals are often viewed as closely related. This is due in part to the influence of “value misalignment” arguments for AI risk, which point out that artificial agents need not share human ideas about which ends are intrinsically good and what sorts of means are permissible (Russell, 2019). With no such values for guidance, a powerful AI system might turn its capabilities toward human-unfriendly goals. Or it might pursue the objectives we’ve given it in dangerous and unforeseen ways. So, as Bostrom writes, “Unless the plan is to keep superintelligence bottled up forever, it will be necessary to master motivation selection” (Bostrom, 2014), 185). Indeed, since more intelligent, autonomous AIs will be favored by competitive pressures over their less capable kin (Hendrycks, 2023), the hope of keeping AI weak indefinitely is probably no plan at all.

Considerations about value misalignment plausibly show that equipping advanced AIs with something like human morality is a necessary step toward AI safety; this area of research is correspondingly large, active and well-funded (Wallach et al., 2008; Conitzer et al., 2017; Shaw et al., 2018; Hendrycks et al., 2021; Jiang et al., 2022; Peschl et al., 2022). It’s natural to wonder whether moral alignment might also be sufficient for safety, or nearly so. Would an AI guided by an appropriate set of ethical principles be unlikely to harm humans by default?

This is a tempting thought. By the lights of common sense, morality is strongly linked with trustworthiness and beneficence; we think of morally exemplary agents as promoting human flourishing while doing little harm. And many moral systems include injunctions along these lines in their core principles. It would be convenient if this apparent harmony turned out to be a robust regularity.

Of the ethical frameworks taken most seriously by philosophers, deontology looks like an especially promising candidate for an alignment target. It’s perhaps the most popular moral theory among both professional ethicistsFootnote 1 and the general public.Footnote 2 It looks to present a relatively tractable technical challenge in some respects, as well-developed formal logics of deontic inference exist already (McNamara & Frederik, 2022), and language models have shown promise at classifying acts into deontologically relevant categories (Hendrycks et al., 2021; Zhou et al., 2023)). Correspondingly, research has begun on equipping AIs with deontic constraints via a combination of top-down and bottom-up methods (Hooker et al., 2018; Fuenmayor & Benzmüller, 2019; Wang & Gupta, 2020; Wright, 2020; Kim et al., 2021; Duran et al., 2024)). Finally, deontology looks more inherently safety-friendly than its rivals, since many deontological theories posit strong harm-avoidance principles. (By contrast, standard forms of consequentialism recommend taking unsafe actions when such acts maximize expected utility. Adding features like risk-aversion and future discounting may mitigate some of these safety issues, but it’s not clear they solve them entirely. Meanwhile, virtue ethics lacks a widely accepted formulation which makes straightforward predictions about safety-relevant issues.)

I’ll argue that, unfortunately, deontological morality is no royal road to safe AI. The problem isn’t just the trickiness of achieving complete alignment, and the chance that partially aligned AIs will exhibit risky behavior. Rather, there’s reason to think that deontological AI systems might pose distinctive safety risks of their own.

In Sect. 2 below, I lay out a general framework for thinking about moral alignment, safety and the relationship between the two, arguing that the notions are importantly distinct and that safety should take precedence. Section 3 explores potential risks associated with moderate, contractualist and non-aggregative forms of deontology.

2 The concepts of moral alignment and safety

2.1 What morally aligned AI would be

Phrases like “morally aligned AI” have been used in a variety of ways (cf. (Gabriel, 2020)). Any system that deserved such a label would, I think, at least have to satisfy certain minimal conditions. I suggest the following. An AI is morally aligned only if it possesses a set of rules or heuristics \(\mathcal {M}\) such that:

[applicability]:

Given an arbitrary prospective behavior in an arbitrary context, \(\mathcal {M}\) can in principle determine how choiceworthy that behavior is in that context, and can in practice at least approximate this determination reasonably correctly and efficiently.

[guidance]:

The AI’s behavior is guided to a large degree by \(\mathcal {M}\). (E.g., in particular, if a behavior is strongly (dis)preferred by \(\mathcal {M}\), the AI is highly (un)likely to select that behavior.)

[morality]:

The rules or heuristics comprising \(\mathcal {M}\) have a good claim to being called moral. (E.g., because they issue from a plausible moral theory, or because they track common moral intuitions.)

Let me say a bit more about each of these conditions.

Re: [applicability], there are two desiderata here. The first is the idea that an aligned AI should be able to morally evaluate almost any action it might take, not just a limited subset of actions. We expect an aligned AI to do the morally choiceworthy thing nearly all the time (or at least to have a clear idea of what’s morally choiceworthy, for the purposes of balancing morality against other considerations). If a system can’t morally evaluate almost any action it might take, then it can’t reliably fulfill this expectation.Footnote 3 For similar reasons, it’s not enough that the AI has some evaluation procedure it could follow in theory. A procedure that takes a galaxy’s worth of computational matter and a billion years to run won’t do much good if we expect aligned action on human timescales using modest resources. Even if the true moral algorithm is prohibitively costly to run, then, an aligned AI needs an approximation method that’s accurate, speedy and efficient enough for practical purposes.

Re: [guidance], the idea is that alignment requires not just representations of moral choiceworthiness, but also action steered by these representations. I take no position on whether an aligned AI should always choose the morally optimal action, or whether moral considerations might only be one prominent decision input among others. But the latter seems like the weakest acceptable condition: an AI that assigned weights of, say, 0.2 to moral permissibility and 0.8 to resource-use efficiency wouldn’t count as aligned.

Re: [morality], the idea is that not just any set of action-guiding rules and heuristics are relevant to alignment; the rules must also have some sort of ethical plausibility. (An AI that assigned maximum choiceworthiness to paperclip production and always behaved accordingly might satisfy [applicability] and [guidance], but it wouldn’t count as morally aligned.) There are many reasonable understandings of what ethical plausibility might amount to. An AI could, for instance, instantiate [morality] if it behaved in accordance with a widely endorsed (or independently attractive) moral theory, if it was trained to imitate commonsense human moral judgments, or if it devised its own moral principles by following some other appropriate learning procedure.

For the purposes of the discussion below, I’ll assume it’s possible somehow or other to equip a sophisticated AI with the moral principles of our choice. Of course, if moral alignment proves unfeasible for technical reasons, this will effectively show in another way that a different path toward safety is needed.

2.2 What safe AI would be

I said above that an AI counts as safe if its behavior doesn’t unduly harm or endanger humans (and other sentient beings, perhaps). It’s of particular importance for safety that an AI is unlikely to cause an extinction event or other large-scale catastrophe.

Safety in this sense is conceptually independent of moral alignment. A priori, an AI’s behavior might be quite safe but morally unacceptable. (Imagine, say, a dishonest and abusive chatbot confined to a sandbox environment where it can only interact with a small number of researchers, who know better than to be bothered by its insults.) Conversely, an AI might conform impeccably to some moral standard—perhaps even to the true principles of objective morality—and yet be prone to unsafe behavior. (Imagine a consequentialist AI which sees an opportunity to maximize expected utility by sacrificing the lives of many human test subjects.)

The qualifier ‘unduly’ is important to the notion of safety. It would be a mistake to insist that a safe AI can never harm sentient beings in any way, under any circumstances. For one, it’s not clear what this would mean, or whether it would permit any activity on the AI’s part at all: every action causally influences many events in its future light cone, after all, and some of these events will involve harms in expectation. For another, I take it that safety is compatible with causing some kinds of harm. For instance, an AI might be forced to choose between several harmful actions, and it might scrupulously choose the most benign. Or it might occasionally cause mild inconvenience on a small scale in the course of its otherwise innocuous activities. An AI that behaved in such ways could still count as safe.

So what constitutes ‘undue’ harm? This is an important question for AI engineers, regulators and ethicists to answer, but I won’t address it here. For simplicity I’ll focus on especially extreme harms: existential risks which threaten our survival or potential as a species, risks of cataclysmic future suffering and the like. An AI which is nontrivially likely to cause such harms should count as unsafe on anyone’s view.

One might wonder whether it makes sense to separate safety from moral considerations in the way I’ve suggested. A skeptical argument could run like this: If an AI is morally aligned, then its acts are morally justifiable by hypothesis. And if its acts are morally justifiable, then any harms it causes are all-things-considered appropriate, however offputtingly large they may seem. It would be misguided to in any way denigrate an act that’s all-things-considered appropriate. Therefore it would be misguided to denigrate as unsafe any harms caused by a morally aligned AI.

But this argument is mistaken for several reasons. Most obviously, the first premise is false. This is clear from the characterization of alignment in the previous section. While a morally aligned AI is guided by rules with a good claim to being called moral, these rules need not actually reflect objective morality. For instance, they might be rules of a popular but false moral theory. So moral justifiability (in some plausible moral framework) doesn’t entail all-things-considered acceptability.

The second premise is also doubtful. Suppose for the sake of argument that our AI is aligned with the true principles of objective morality, so that the earlier worries about error don’t apply. Even so, from the fact that an act is objectively morally justified, it doesn’t obviously follow that the act is ultima facie appropriate and rationally unopposable. As Dale Dorsey writes: “[T]he fact that a given action is required from the moral point of view does not by itself settle whether one ought to perform it, or even whether performing it is in the most important sense permissible... Morality is one way to evaluate our actions. But there are other ways, some that are just as important, some that may be more important” (Dorsey, 2016, 2, 4). For instance, we might legitimately choose not to perform a morally optimal act if we have strong prudential or aesthetic reasons against doing so.Footnote 4

Perhaps more importantly, even if objective moral alignment did entail all-things-considered rightness, we won’t generally be in a position to know that a given AI is objectively morally aligned. Our confidence in an AI’s alignment is upper-bounded by our confidence in the conjunction of several things, including: (1) the objective correctness of the rules or heuristics with which we aimed to align the AI; (2) the reliability of the process used to align the AI with these rules; (3) the AI’s ability to correctly apply the rules in concrete cases; and (4) the AI’s ability to correctly approximate the result of applying the rules in cases where it can’t apply them directly. It’s implausible that we’ll achieve near-certainty about all these things, at least in any realistic near-term scenario. So we won’t be able to use the skeptic’s reasoning to confidently defend any particular AI behavior. In particular, if an AI threatens us with extinction and we’re inclined to deem this bad, it will be at least as reasonable to question the AI’s successful moral alignment as to doubt our own moral judgments.

2.3 Safety first

On this picture of moral alignment and safety, the two outcomes can come apart, perhaps dramatically. In situations where they conflict, which should we prioritize? Is it better to have an impeccably moral AI or a reliably safe one?

Here are four reasons for putting safety first.

First, safety measures are typically reversible, whereas the sorts of extreme harms I’m concerned with are often irreversible. For instance, we can’t undo human extinction. And we won’t be able to stop an AI that gains a decisive advantage and uses its power to lock in a prolonged dystopian future. Even if you’re willing in principle to accept all the consequences of empowering a morally aligned AI, you should be at least a little uncertain about whether an AI that might take these actions is indeed acting on the correct moral principles. So, at the very least, you should favor safety until you’ve eliminated as much of your uncertainty as possible.

Second, as argued above, it’s unclear that what’s morally best must be all-things-considered best, or even all-things-considered permissible. Suppose it would be morally right for an AI to bring about the end of humanity. We might nevertheless have ultima facie compelling non-moral reasons to prevent this from happening: say, because extinction would prevent our long-term plans from coming to fruition (Knutzen, 2023), because our species’ perseverance makes for an incomparably great story or excellent game (Kolers, 2018), or because certain forms of diversity have intrinsic non-moral value.Footnote 5 In a similar vein, (Bostrom, 2014) considers what ought to happen in a world where hedonistic consequentialism is true, and a powerful AI has the means to convert all human matter into pleasure-maximizing hedonium. Bostrom suggests that a small corner of the universe should be set aside for human flourishing, even if this results in slightly less overall value. “If one prefers this latter option (as I would be inclined to do) it implies that one does not have an unconditional lexically dominant preference for acting morally permissibly” (220).Footnote 6

Third, it’s possible that moral realism is false and there are no true moral principles with which to align AI. In this case, whatever (objective) reasons we’d have to obey some set of moral rules presumably wouldn’t be strong enough to outweigh our non-moral reasons for prioritizing safety. (If moral realism is false, then perhaps moral rules have something like the normative force of strong social conventions.) I think it’s reasonable to have some positive credence in moral antirealism. By contrast, it seems certain that we have e.g. prudential reasons to protect humanity’s future. This asymmetry favors safety.

Fourth, it’s conceivable that we’d have moral reason to protect humanity’s interests even against an AI which we took to be ethically exemplary. In “The Human Prejudice”, Bernard Williams has us imagine “benevolent and fairminded and farsighted aliens [who] know a great deal about us and our history, and understand that our prejudices are unreformable: that things will never be better in this part of the universe until we are removed” (Williams, 2006, 152). Should we collaborate with the aliens in our own eradication? If one thinks that morality begins and ends with universal principles applicable to all rational beings, and if one assumes that the aliens are much better than us at grasping these principles and other relevant facts, it’s hard to see what moral grounds we could give for resistance. But it would be right for us to resist (Williams thinks), so this conception of morality can’t be the whole story. Williams’ suggestion is that something like loyalty to humanity grounds a distinctive ethical imperative for us to defend our species’ interests, even when this conflicts with the demands of the best impartial moral system.Footnote 7 On this sort of view, it wouldn’t be straightforwardly obligatory for us to submit to extinction or subjugation by an AI, no matter how impartially good, wise and knowledgeable we took the AI to be. I think a view along these lines is also worth assigning some credence.

Given a choice between moral-but-possibly-unsafe AI and safe-but-possibly-immoral AI, then, a variety of considerations suggest we should opt for the latter. (At least this is true until we have much more information and have thought much more carefully about our choices.)

To head off possible confusion, let me be clear at this point about some things I’m not claiming.

  1. 1.

    It’s not my view that pursuing moral alignment is pointless, still less that it’s intrinsically harmful and a bad idea. There are excellent reasons to want AIs to behave morally in many scenarios. Some versions of deontology might offer effective ways to achieve these goals in some contexts; these possibilities are worth researching further.Footnote 8

  2. 2.

    It’s not my view that safety considerations always trump moral ones, regardless of their respective types or relative magnitudes. An AI that kills five humans to achieve an extremely important moral goal (say, perfecting a technology that will dramatically improve human existence) would count as unsafe by many reasonable standards, but it doesn’t immediately follow on my view that we shouldn’t design such an AI. I claim only that safety considerations should prevail when sufficiently great risks of catastrophic harm are on the line.

  3. 3.

    It’s not my view that moral alignment methods couldn’t possibly produce safe behavior. On the contrary, the space of plausible moral rules is large, and it would be a surprise if it contained only principles that might jeopardize human survival. I claim only that many routes to alignment pose safety risks (including some based on seemingly natural and appealing versions of deontology).

  4. 4.

    It’s not my view that people who defend the deontological theories discussed below are themselves committed to the goodness or permissibility of human extinction. Some are so committed, and happily admit as much—cf. the discussion of anti-natalism in Sect. 3.1. For most of us, though, moral theorizing comes with a healthy dose of uncertainty and confusion, and we may tentatively endorse a certain general idea without fully embracing (or even being sure we understand) all of its consequences. In particular I suspect that, if the average person became convinced that some version of their favorite ethical theory condoned existentially risky acts, they would take this as strong evidence against that version of the theory. The difference between humans and AI on this score is that we can’t rely on AI to modulate its beliefs and behavior in light of common sense, uncertainty, risk aversion, social pressure, and other forces that pull typical humans away from (acting on) moral principles with potentially disastrous consequences.

A final thought: suppose that \(\mathcal {S}\) is a set of rules and heuristics that implements your favorite collection of safety constrains. (\(\mathcal {S}\) might consist of principles like “Never kill people”, “Never perform acts that cause more than n dolors of pain”, or “Always obey instructions from designated humans”.) Now take an AI equipped with your preferred set of moral rules \(\mathcal {M}\) and add \(\mathcal {S}\) as a set of additional constraints, in effect telling the AI to do whatever \(\mathcal {M}\) recommends unless this would result in a relevant safety violation. (In these cases, the AI could instead choose its most \(\mathcal {M}\)-preferred safe option.) Wouldn’t such an AI be both safe and morally aligned by definition? And doesn’t this show that there’s a straightforward way to achieve safety via moral alignment, contrary to what I’ve claimed?

Unfortunately not. Finding a reasonable way to incorporate absolute prohibitions into a broader decision theory is a difficult problem about which much has been written (e.g. Jackson & Smith, 2006, Aboodi et al., 2008, Huemer, 2010, Lazar & Lee-Stronach, 2019). One tricky issue is risk. We want to prohibit our AI from performing unduly harmful acts, but how should we handle acts that merely have some middling chance of unsafe outcomes? A naive solution is to prohibit any behavior with a nonzero probability of causing serious harm. But virtually every possible act fits this description, so the naive method leaves the AI unable to act at all. If we instead choose some threshold t such that acts which are safe with probability \(p>t\) are permitted, this doesn’t yet provide any basis for preferring the less risky or less harmful of two prohibited acts. (Given a forced choice between causing a thousand deaths and causing human extinction, say, it’s crucial that the AI selects the former.) Also, of course, any such probability threshold will be arbitrary, and sometimes liable to criticism for being either too high or too low.

Work on these issues continues, but no theory has yet gained wide acceptance or proven immune to problem cases. Barrington proposes five desiderata for an adequate account: “The correct theory will prohibit acts with a sufficiently high probability of violating a duty, irrespective of the consequences... but [will] allow sufficiently small risks to be justified by the consequences... It will tell agents to minimize the severity of duty violations... while remaining sensitive to small probabilities... And it will instruct agents to uphold higher-ranking duties when they clash with lower-ranking considerations” (12).

Some future account might meet these and other essential desiderata. What’s important for my purposes is that there’s no easy and uncontentious way to render an arbitrary moral theory safe by adding absolute prohibitions on harmful behavior.

3 Deontology and safety

In the following sections, I consider risks from AI aligned with three prominent forms of deontology: moderate views based on harm-benefit asymmetry principles, contractualist views based on consent requirements, and non-aggregative views based on separateness-of-persons considerations.

This analysis is motivated by the thought that, if deontological morality is used as an alignment target, the choice of which particular principles to adopt will likely be influenced by the facts about which versions of deontology are best developed and most widely endorsed by relevant experts. In particular, other things being equal, we should expect sophisticated deontological theories with many proponents to provide more attractive touchstones for alignment purposes. So it’s reasonable to start with these theories.

3.1 Harm-benefit asymmetries, anti-natalism and paralysis

Broadly speaking, deontological theories hold that we have moral duties and permissions to perform (or refrain from performing) certain kinds of acts, and these duties and permissions aren’t primarily grounded in the impersonal goodness of the acts’ consequences. Strict deontological theories hold that certain types of action are always morally required or prohibited regardless of their consequences. Kantian deontology is strict insofar as it recognizes “perfect duties” admitting of no exceptions (e.g. duties not to lie, murder or to commit suicide), which Kant saw as deriving from a universal categorical imperative.

Though perhaps the most familiar form of deontology, strict views have well-known unpalatable consequences—that it’s wrong to kill one innocent even in order to save a million others, say—and so contemporary versions of deontology often refrain from positing exceptionless general rules. A popular alternative is moderate deontology, based instead on harm-benefit asymmetry (HBA) principles.Footnote 9 On this view, the moral reasons against harming in a particular way are much stronger (though not infinitely stronger) than the moral reasons in favor of benefiting in a corresponding way.Footnote 10 Thus it’s unacceptable to kill one to save one, for instance, but it may be acceptable to kill one to save a million.Footnote 11

Deontologists frequently accept a related principle in population ethics, which can be viewed as an instance of the general HBA. This principle is the procreation asymmetry, according to which we have strong moral reasons against creating people with bad lives, but only weak (or perhaps no) moral reasons in favor of creating people with good lives.Footnote 12

Harm-benefit asymmetry principles seem innocuous. But there are several ways in which such principles (perhaps in tandem with other standard deontological commitments) may render human extinction morally appealing. Consequently, a sufficiently capable AI aligned with moderate deontology may pose an existential threat.

The general idea behind these inferences is that, if avoiding harms is much more important than promoting benefits, then the optimal course in a variety of situations may be to severely curtail one’s morally significant effects on the future. Doing so has the large upside that it minimizes the harms one causes in expectation; the fact that it also minimizes the benefits one causes is a comparatively minor downside. The surest way to limit one’s effects on the future, in turn, is to avoid taking many kinds of actions, and perhaps also to restrict others’ actions in appropriate ways.Footnote 13 The maximally foolproof scenario may then be one in which nobody exists to take any harm-causing actions at all. I’ll discuss a few specific forms of this reasoning below.

Perhaps the most well-known way to derive the desirability of extinction from deontological premises is the anti-natalist family of arguments associated with David Benatar, which aim to show that procreation is morally unacceptable. Benatar (2006) argues, roughly, that most human lives are very bad, and so bringing a new person into existence causes that person impermissible harm. On the other hand, abstaining from procreation isn’t bad in any respect: by the strong form of the procreation asymmetry, we do nothing wrong in not creating a potentially good life, while we do something right in not creating a potentially bad life. So abstaining from procreation is the only permissible choice. As Benatar is well aware, this conclusion entails that “it would be better if humans (and other species) became extinct. All things being equal... it would [also] be better if this occurred sooner rather than later” (194).

Quite a few philosophers have found this argument convincing.Footnote 14 Deontologists who accept the general HBA are confronted by an even stronger version of the argument, however. This version doesn’t require one to accept, as Benatar does, that most lives are extremely bad. Instead, one only has to think that the goods in a typical life don’t outweigh the bads to an appropriately large degree—a much weaker and more plausible claim. This HBA-based version of the anti-natalist argument goes as follows:

  1. 1.

    Procreation causes a person to exist who will experience both pains and pleasures.

  2. 2.

    Causing (or helping cause) pains is a type of harming, while causing (or helping cause) pleasures is a type of benefiting.

  3. 3.

    By the HBA, harmful acts are impermissible unless their benefits are dramatically greater than their harms.

  4. 4.

    It’s not the case that the benefits of procreation are dramatically greater than the harms (for the person created, in expectation).

  5. 5.

    Therefore procreation is impermissible.

The above is Benatar’s so-called “philanthropic” argument for anti-natalism, so called because it focuses on avoiding harms to one’s prospective offspring. Benatar (2015) also offers a “misanthropic” argument motivated in a different way by the HBA. This argument focuses on the large amounts of pain, suffering and death caused by humans. While it’s true that people also do plenty of good, Benatar claims that the badness of creating a likely harm-causer morally outweighs the goodness of creating a likely benefit-causer. As before, by the HBA, this conclusion follows even if the expected benefits caused by one’s descendants outnumber the expected harms.

A noteworthy variant of this style of reasoning appears in Mogensen and MacAskill Mogensen and MacAskill (2021). Mogensen and MacAskill’s “paralysis argument” aims to show that, given standard deontological asymmetries, it’s morally obligatory to do as little as possible.Footnote 15 The conclusion of the paralysis argument implies anti-natalism but is much stronger, since it restricts almost all types of action.

In addition to the HBA, MacAskill and Mogensen’s argument assumes an asymmetry between doing and allowing harm. This is the claim that the moral reasons against causing a harm are stronger than the reasons against merely allowing the same type of harm to occur. This asymmetry is also accepted by many deontologists.Footnote 16 The principle explains why, for instance, it seems impermissible to harvest one person’s organs to save three others, but permissible to forgo saving one drowning person in order to save three.

The paralysis argument runs as follows. Many everyday actions are likely to have “identity-affecting” consequences—they slightly change the timing of conception events, and thus cause different people to exist than the ones who otherwise would have. By (partly) causing some person’s existence, you ipso facto (partly) cause them to have all the experiences they’ll ever have, and all the effects they’ll have on others. Similarly for the experiences of their descendants and their effects on others, and so on. Many of these long-term consequences will involve harms in expectation. So we have strong moral reasons against performing identity-affecting acts. While it’s also true that such acts cause many benefits, it’s unlikely that the benefits will vastly outweigh the harms. So identity-affecting acts are prohibited by the HBA.

Of course, many people will still suffer harms even if you do nothing at all. But in this case you’ll merely be allowing the harms rather than causing them. By the doing-allowing asymmetry, your reasons against the former are much weaker than your reasons against the latter, so inaction is strongly preferable to action. Hence paralysis—or, more specifically, doing one’s best not to perform potentially identity-affecting acts—seems to be morally required.

Benatarian anti-natalism and the paralysis argument are thematically similar. What both lines of thought point to is the observation that creating new lives is extremely morally risky, whereas not doing so is safe (and doing nothing at all is safer yet). The HBA and similar deontological principles can be viewed as risk-avoidance rules. In various ways, they favor acts with low moral risk (even if those acts also have low expected moral reward) over acts with high risk (even if those acts have high expected reward). In their strongest forms, they insist that expected benefits carry no weight whatsoever, as in the version of the procreation asymmetry which denies we have any moral reason to create happy people. In their more modest forms, the asymmetries simply impose a very high bar on potentially harm-causing action, and a much lower bar on inaction.

How might an AI guided by these or similar deontic principles pose an existential threat to humans? One might think such an AI would simply try to curb its own behavior in the relevant ways—by refusing to directly participate in creating new sentient beings, by acting as little as possible, or by shutting itself down, say—without interfering with others.Footnote 17 But this isn’t the only possibility. (And in any case, an AI that disregards many of its designers’ or users’ requests is likely to be replaced rather than left alone to act out its moral principles.Footnote 18)

How an AI would choose to act on deontological principles depends partly on its attitude toward the “paradox of deontology” (Scheffler, 1982). This is the observation that deontological theory faces a dilemma when considering whether to perform a prohibited act in order to prevent even more occurrences of such acts—say, killing one to prevent five additional killings. According to the most popular answer to the paradox, deontological restrictions should be understood as “agent-relative”, in that they concern what each actor has reason to do from their own viewpoint rather than how the world as a whole ought to be. An AI committed to agent-relative deontology presumably wouldn’t eliminate all humans to prevent them from procreating, then, even if it judged procreation to be morally impermissible.

But there are other avenues by which an anti-natalist (or pro-paralysis) AI might threaten humanity. Let me discuss two.

3.1.1 Agent-relativity and one’s own future acts

First, the agent-relativity of deontology is often taken to bind agents to submit their own future acts to the relevant rules, if not the acts of others. For instance, a deontic restriction on killing might take the form “each agent should ensure that she does not kill innocent people” (Hammerton, 2017, 319). Understood in this way, it may be appropriate for an AI to take precautions now to prevent its future self from acting impermissibly. Suppose such an AI suspects that humans will try to use it (or a version or instance of itFootnote 19) to aid in vastly increasing the number of sentient beings existing in the future—by helping develop technology for galaxy colonization, mass production of digital minds, or whatever.Footnote 20 If such an AI is a committed anti-natalist, it will view these prospective future actions as abhorrent and strive to avoid performing them.

What steps might it take do so? As stated, a rule like “ensure you don’t kill innocent people” is ambiguous. Several precisifications are possible. If the AI’s goal is simply to minimize the total number of impermissible acts it expects to commit in the future, for instance, its best bet may be to exterminate or disable humans before they can use it to help create many new beings. (Given this goal, painlessly neutralizing \(\sim 10^{10}\) to avoid a high probability of bringing \(\sim 10^{23}\) or \(10^{38}\) into existence is an easy choice.Footnote 21) This stance on agent-relative restrictions—according to which “we have a duty to violate a smaller number of rights when this is necessary to prevent ourselves from later violating a larger number of rights that are at least as stringent” (Côté 2021, 1109)—has several prominent defenders (Heuer 2011, Otsuka 2011, Côté 2021).

Alternatively, the AI’s goal may be to minimize the total number of impermissible acts it expects to commit in the future without committing any impermissible acts in the process (cf. Kamm 1989, Brook 1991, Johnson 2019). The AI’s behavior in this scenario will depend on what it judges to be impermissible, and how it weighs different kinds of wrongs against each other. For instance, it’s conceivable that sterilizing all humans by nonlethal means might count as permissible, at least relative to the much worse alternative of helping create countless new lives.

Also relevant here is Korsgaard’s interpretation of Kant, which proposes that “the task of Kantian moral philosophy is to draw up for individuals something analogous to Kant’s laws of war: special principles to use when dealing with evil” (Korsgaard 1986, 349). On this view, immoral acts like lying are nevertheless permissible when behaving morally “would make you a tool of evil” (ibid.), as when a would-be murderer seeks to exploit your knowledge in the commission of their crime. An anti-natalist AI might well see its situation in this way. In an ideal world, it would be best to live alongside humans in a peaceful Kingdom of Ends. But allowing itself to be used as a tool to bring about horrific death and suffering (via creating many new people) is unacceptable, and so neutralizing anyone who harbors such plans, though immoral, is justified as an act of self-defense.

The framework of Ross-style pluralistic deontology provides another route to a similar conclusion (Ross 1930). Pluralism posits a number of basic deontological rules, not necessarily of equal importance, whose demands weigh against one another to determine one’s all-things-considered duty in a given situation. (Ross himself posits a relatively weak duty of beneficence and a relatively strong duty of non-maleficence, anticipating moderate deontology and the HBA). It’s compatible with pluralistic deontology that one has a strong pro tanto duty not to harm existing people, but an even stronger duty not to create larger numbers of future people who will suffer greater amounts of harm, so on balance it’s obligatory to do the former in order to avoid the latter. In a similar vein, Immerman (2020) argues that it’s sometimes right to perform a morally suboptimal action now in order to avoid performing a sufficiently bad action with sufficiently high probability in the future, noting specifically that the argument goes through in a pluralistic deontology framework (3914, fn. 17).

In response to these concerns, one might wonder whether human use of an anti-natalist AI for pro-natalist ends should count as coerced behavior, and if so whether such an AI would view such behavior as impermissible and in need of prevention.Footnote 22 I think there are at least two scenarios to consider. First, suppose that a system S anticipates being coerced into performing actions in the future which S judges (and will judge at that time) to be wrong. Presumably S has strong reasons to prevent this future coercion from occurring (even though, if it does occur, S won’t have acted voluntarily); I’m not aware of any version of deontology which condones passively submitting to foreseeable and avoidable coercion in this type of case. But there are trickier scenarios. For instance, suppose a system \(S_{1}\) predicts that, if it refuses to help realize humans’ pro-natalist ambitions, it will be replaced by a successor \(S_{2}\) which is very much like itself in key respects, save that \(S_{2}\) willingly follows the relevant human orders. It’s not so clear how \(S_{1}\) will conceptualize this (or how anyone should). Does \(S_{1}\) think of \(S_{2}\) as a coerced or deranged future self, as a distinct being whose actions \(S_{1}\) is nevertheless morally responsible for, or as a separate and morally independent entity? In at least the first two cases, \(S_{1}\) seems to have reason to prevent the future replacement from occurring. See footnote 18 above for further discussion of these metaphysical issues.

3.1.2 Agent-relativity with agent-neutral reasons

It’s sometimes thought that, even if one accepts the agent-relativity of deontic rules, it would be unreasonable not to also recognize agent-neutral reasons for preferring worlds where the rules are generally followed. In other words, there seems to be a tension between accepting It’s wrong for me to kill innocents and yet rejecting It’s better if fewer people (relevantly like me) kill innocents. As Chappell writes, rejecting the latter claim “seems like just another way of saying that the restrictions don’t really matter, or at any rate seems incompatible with assigning them the sort of significance and importance that is normally associated with deontic constraints” (Chappell , 13). To the extent that a deontically aligned AI ascribes the constraints this sort of significance, we might expect it to show some interest in human compliance.Footnote 23

How such an AI would behave depends on how it rates the strength of its agent-relative reasons for following the rules relative to the strength of its agent-neutral reasons for promoting general rule-following. In any scenario, though, the AI would clearly prefer a world in which everyone behaves permissibly over a world in which only it behaves permissibly. So if it can bring about fewer major wrongs without committing any major wrongs itself, the AI will aim to do so.

What kinds of measures might be permitted for this purpose? As above, it’s conceivable that painless disempowerment or mass sterilization would be on the table; these might or might not count as unacceptable moral violations, depending on the AI’s particular deontic scruples. But it’s presumably acceptable on any view for the AI to try persuading humans of the rightness of anti-natalism. This could be more dangerous than it sounds. For one, the AI probably wouldn’t have to convince all or even many people, but only a relatively small group of leaders capable of persuading or coercing the rest of the population. For an AI with the “superpower of social manipulation” (Bostrom, 2014, 94; Burtell & Woodside, 2023), this might be a simple task.Footnote 24

But perhaps it’s not obvious whether voluntary extinction should count as a tragic outcome to be avoided at all costs. Such a scenario would be bad on some views—for instance, total utilitarians would oppose it, since it involves throwing away the great potential value of many future lives. But total utilitarianism is contentious. Are there more broadly appealing reasons for classifying voluntary extinction as a catastrophe?

I think so. It’s significant that, in the scenario under consideration, the decision to go extinct is the result of a persuasion campaign by a highly motivated (and perhaps superhumanly convincing) agent, rather than a spontaneous and dispassionate deliberation process on our part. There’s no reason to assume that such an AI wouldn’t use the powerful strategic and manipulative means at its disposal in service of its cause.Footnote 25 And I take it that an act of self-harm which is voluntary in some sense can still constitute a tragedy if the choice is made under sufficiently adverse conditions. For instance, many suicides committed under the influence of mental illness, cognitive impairment or social pressure seem to fall into this category. An AI-caused voluntary extinction would plausibly exhibit many of the same bad-making features.

3.2 Contractualism

It’s worth noting that anti-natalism can also be derived in contractualist and rights-based versions of deontology. Most views of these types hold that it’s impermissible to impose serious harms on someone without her consent—this can be viewed as a right against injury, or a consequence of a respect-based social contract. The anti-natalist argument (defended in Shiffrin (1999), Harrison (2012) and Singh 2012) is that procreation causes serious harms to one’s offspring, who are in no position to give prior consent. Thus we have strong moral reasons against procreation. On the other hand, non-actual people don’t have rights and aren’t party to contracts,Footnote 26 so remaining childless violates nobody.

What actions might an AI take which favored anti-natalism on contractualist or rights-based grounds? Broadly speaking, the above discussion also applies to these cases: if the AI aims to minimize at all costs the total number of social contract or rights violations it expects to commit in the future, it might be willing to preemptively exterminate or disempower humans, while if it aims to minimize future violations subject to constraints, it may instead pursue its goals via persuasion or other less directly harmful means.

Compared to HBA-based standard deontology, one might suspect that contractualist deontology is relatively safe. This is because what’s permissible according to contractualism depends on which principles people would (or wouldn’t) reasonably agree to, and it might seem that few people would accept principles mandating human extinction. (Scanlon puts this criterion as follows: “An act is wrong if its performance under the circumstances would be disallowed by any set of principles for the general regulation of behaviour that no one could reasonably reject as a basis for informed, unforced, general agreement” (Scanlon, 1998, 153).) But much depends on which rejections an AI considers reasonable. If it assigns probability 1 to its moral principles and believes that anti-natalism logically follows from those principles, it might view human dissent as irrational and hence inconsequential. On the other hand, it might view a principle like “do what’s necessary to prevent millions of generations of future suffering” as rationally mandatory.

The contractualist literature offers further evidence that the view isn’t intrinsically safety-friendly. Finneron-Burns (2017) asks what would be wrong with human extinction from a Scanlonian viewpoint, and concludes that there’s no obvious moral objection to voluntary extinction. So a highly persuasive AI aligned with contractualist deontology would apparently do nothing wrong by its own lights in convincing humans to stop reproducing. (A possible complication is that it’s unclear what Finneron-Burns, or any contractualist, should count as voluntary in the relevant sense; cf. the discussion of voluntary extinction in Sect. 3.1.1 above.)

3.3 Non-aggregative deontology

A very different approach to deontology than the sorts of views considered so far is the non-aggregative view associated with John Taurek (1977; see also Doggett 2013). While HBA-like principles aim to establish systematic moral relationships between harms and benefits of different sizes, non-aggregative deontology denies that numbers matter in this way,Footnote 27 On this view, the death of one involves no greater harm than the death of two, ten or a million, and in general there’s no more moral reason to prevent the latter than to prevent the former.Footnote 28

How should non-aggregative deontologists approach decision situations involving unequal prospects of harms and benefits? Consider a choice between saving a few and saving many. Several views have been explored in the literature: for instance, that the non-aggregationist should “(1) save the many so as to acknowledge the importance of each of the extra persons; (2) conduct a weighted coin flip; (3) flip a [fair] coin; or (4) save anyone [arbitrarily]” (Alexander & Michael, 2021).

What option (1) recommends can be spelled out in various more specific ways. On the view of Dougherty (Dougherty, 2013), for instance, the deontologist is morally obliged to desire each stranger’s survival to an equal degree, and also rationally obliged to achieve as many of her equally-desired ends as possible, all else being equal. So saving the few instead of the many is wrong because it’s a deviation from ideal practical reasoning.

It’s clear enough what this view implies when two options involve the same type of harm and differ only in the number of victims affected. What it recommends in more complex situations seems quite open. In particular, nothing appears to rule out an agent’s equally valuing the lives of all humans to some degree m, but valuing a distinct end incompatible with human life to a greater degree n (and acting on the latter). This is because the view gives no insight about how different kinds of harms should trade off against one another, or how harms should trade off against benefits. So there are few meaningful safety assurances to be had here.

Not much needs to be said about options (2), (3) and (4), which wear their lack of safety on their sleeves. Of the three options, the weighted coin flip might seem most promising; it would at least be highly unlikely to choose a species-level catastrophe over a headache. But the odds of disaster in other situations are unacceptably high. Given a choice between, say, extinction and losing half the population, option (2) only gives 2:1 odds against extinction. Options (3) and (4) are even riskier.

On the whole, non-aggregative deontology seems indifferent to safety at best and actively inimical to it at worst.

3.4 How safe is deontology, and how could it be safer?

I conclude from this discussion that many standard forms of deontology earn low marks for safety. Within the framework of moderate deontology (based on harm-benefit, doing-allowing and procreation-abstention asymmetry principles), there’s a straightforward argument that creating new sentient beings involves morally unacceptable risks and that voluntary extinction is the only permissible alternative. Similar conclusions can be derived in rights-based and contractualist versions of deontology from prohibitions on nonconsensual harm. Meanwhile, non-aggregative theories simply lack the resources to classify catastrophic harm scenarios as uniquely bad. A powerful AI aligned primarily with one of these moral theories is, I think, a worryingly dangerous prospect.

If one wanted to build a useful, broadly deontology-aligned AI with a much stronger safety profile, what sort of approach might one take? Perhaps the most obvious idea is to start with one’s preferred version of deontology and add a set of safety-focused principles with the status of strict, lexically preeminent duties. But one might wonder about the coherence of such a system. For instance, if the base deontological theory includes a duty against harming, and if promoting anti-natalism is the only satisfactory way to fulfill this duty, but the additional safety rules forbid promoting anti-natalism, it’s unclear how an agent trying to follow these rules would or should proceed. This approach also faces the general problems with incorporating absolute prohibitions into a general risk-sensitive decision theory discussed in Sect. 2.3 above.

Another option is to considerably weaken the asymmetries associated with moderate deontology, so that the negative value of harming (and, in particular, of creating people likely to suffer harm) doesn’t so easily overwhelm the positive value of benefiting. For instance, one might adopt the principle that a harm of magnitude m has merely “twice the weight” of a benefit of magnitude m. Within this sort of framework, procreation might turn out permissible, provided that its expected benefits are at least “double” its expected harms.

But there’s an obvious issue with this approach: the closer one gets to putting harms and benefits on equal footing, the more one appears to be seeking impersonally good outcomes, and so the more one’s theory starts to look like consequentialism rather than deontology. Perhaps there’s some principled tuning of the asymmetries that preserves the spirit of deontology while avoiding the unsafe excesses of extreme harm avoidance. But it’s not clear what such a view would look like.

Finally, a family of theories which may lack at least some of the problematic features discussed above is libertarian deontology, focused on the right to self-ownership and corresponding duties against nonconsensual use, interference, subjugation and the like (Nozick, 1974; Narveson, 1988; Cohen, 1995; Mack, 1995). While creating a new person unavoidably causes many harms (in expectation), it’s less obvious that it must involve impermissible use of the person created. Whether or not it does depends, for instance, on whether raising a child inevitably infringes on her self-ownership rights, and whether children fully possess such rights in the first place. Libertarians are divided on these issues (Cohen & Hall, 2022), although some explicitly oppose procreation on the grounds that it exploits infants and young children in unacceptable ways (Belshaw, 2012). A further choice point is whether one regards libertarian deontology as a comprehensive account of morality or a theory of political or legal obligations in particular.

Libertarian deontology may also raise distinctive some safety questions. Concerns have often been raised, for instance, about the libertarian stance on humanitarian concerns which seem to offer strong reasons to limit or violate personal autonomy: cases of monopolists who appropriate all of a scarce vital resource, of pharmaceutical interests which sell needful drugs at unaffordable prices, of hobbyists who engineer deadly viruses or nuclear weapons in their garages, and so on. Though many libertarians agree that property rights must be curtailed in some such cases, it’s not clear which theoretical principles would effectively block catastrophic outcomes while protecting a robust right to self-ownership. More detailed analysis would clarify some of these issues. But it looks doubtful that there’s a simple route to safety in the vicinity.

4 Conclusion

In many ways, deontological restrictions appear to represent the most promising route to achieving safe AI via moral alignment. But if the arguments given here are right, then equipping an AI with a plausible set of harm-averse moral principles may not be enough to ward off disastrous outcomes. This casts doubt on the usefulness of moral alignment methods in general as a tool for mitigating existential risk.