1 Introduction

Since the advent of artificial intelligence (AI) research, some researchers have expressed concern that AI systems might become more intelligent than humans, with eventually fatal outcomes for humanity [1, 2]. In recent years, these worries have intensified [3, 4]. Perhaps the strongest piece of evidence of the growing concern is a statement released by the Center for AI Safety earlier this year. The statement consisted only in one sentence: “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war” [5]. While many other researchers disagreed publicly, the statement was signed by a long list of researchers in AI and related fields, including the deep learning luminaries Geoffrey Hinton and Joshua Bengio.

In this paper, I will evaluate approaches for dealing with catastrophic risks from AI. In second 2, I will clarify which kinds of risks fall within the scope of the paper and why there is a need to address such risks. Subsequently, I will outline a list of desiderata for the evaluation of approaches for addressing catastrophic AI risks. In Sect. 3, I will use these desiderata to discuss approaches to catastrophic AI risks which are based on promoting scientific research. In Sect. 4, I will use the same desiderata to evaluate approaches to catastrophic AI risks which consist in types of policy proposals. Section 5 will summarize the paper.

The strategy pursued in this paper is novel in that it (i) focuses on catastrophic risks, (ii) evaluates all approaches according to a consistent list of desiderata, (iii) combines considerations from AI ethics and from technical AI safety research and (iv) discusses how different approaches interact.

2 Catastrophic AI risks

2.1 What counts as a catastrophic AI risk?

This paper is about approaches aiming to reduce catastrophic AI risks. As I use the term, a catastrophic risk is the risk of an event, or series of events, which causes massive loss of human life, or something similarly bad. To have an arbitrary but precise target, I will stipulate that an event is catastrophic if and only if it involves the loss of 100 million human lives, or something of comparable moral importance.Footnote 1 Notice, however, that similar arguments to the ones below apply even with a less stringent notion of ‘catastrophic risk’. Human extinction would count as a catastrophe in this sense, but not all catastrophes involve extinction.

Since there is no fixed boundary in nature separating catastrophic and non-catastrophic AI risks, approaches to the latter might also be useful for the former. That being said, catastrophic risks also pose distinctive challenges. In particular, they require stringent anticipatory measures since we must take actions against catastrophic risks before they happen. By contrast, if a risk is sufficiently small, it may be prudent to accept the risk and to only take action once the risk has actually materialized and can thus be understood better. As we will see, not every approach suited for dealing with catastrophic risks is helpful for non-catastrophic risks, and vice-versa.

There are two types of risks which are potentially catastrophic, but outside of the scope of this paper, since they raise distinct issues.Footnote 2 Both concern risks to non-human beings. First, as AI advances and becomes more influential in society, this plausibly poses risks to non-human animals. For instance, it has recently been argued that AI contributes to the perpetuation of speciesist bias [10] and that it may further increase animal suffering in factory farming [11]. A full examination of catastrophic AI risks would need to examine whether AI poses catastrophic risks to non-human animals, and – if so – what to do about them. Second, some researchers have argued that future AI might be sentient, and thus capable of suffering [12,13,14]. Since AI systems can be copied easily, the possibility of AI suffering would arguably quickly lead to astronomically high suffering and thus a moral catastrophe.Footnote 3

In the next sub-section, I will explain why approaches to catastrophic AI risk are needed.

2.2 Why is the risk of AI catastrophe sufficiently high to warrant action?

By definition, an AI catastrophe would be very bad. This means that catastrophic AI risk can be high, i.e. severe, even if its probability is relatively low. For instance, suppose the chance of AI catastrophe is 0,1%. By definition, the least severe kind of AI catastrophe involves 100 million human deaths (or something of comparable moral significance). Hence, even given such a low chance of AI catastrophe and with respect to the least severe kind of catastrophe, 100.000 people would die in expectation. Consequently, an argument for taking AI catastrophe risk sufficiently seriously to warrant a decisive response does not need to show that a catastrophe is, in the absence of such a response, very likely, by ordinary standards.

While there is not much research specifically on catastrophic risks, some researchers have argued that AI poses an existential, e.g. extinction, risk [15,16,17,18] .Footnote 4 These arguments typically rest on two key assumptions. First, while predicting technological progress is inherently hard, some available sources support the view that we will create AI which is generally more powerful than humans (let’s call this ‘AGI’) in the next few decades. A survey of expert AI researchers on the question, conducted in 2022, finds a mean forecast of a 50% chance that “unaided machines can accomplish every task better and more cheaply than human workers” in 2059 among respondents [21].

Taken at face value, this suggests that many AI researchers expect there to be AGI within the next decades. This picture is complicated by the fact that the answers heavily depend on the framing of these questions, suggesting that the survey respondents did not think carefully about all questions.Footnote 5 In addition, it is unclear to what extent AI experts are also experts on the long-term prediction of AI development.

Convergent with estimates which expect AGI in the next decades, Ajeya Cotra’s quantitative model, which is one of the most explicit model for predicting the advent of AGI we have, forecasts a 50% chance for ‘transformative’ AI by 2055 where a model is transformative if running many copies of that model (once trained) would have “at least as profound an impact on the world’s trajectory as the Industrial Revolution did” [23]. As discussed in Sect. 3.2, there is reason to be skeptical of these estimates. However, given that we do not have other credible sources and that even a small chance of AI catastrophe is very concerning, these estimates suffice to justify the need for academic discussion on such risks and possible societal responses.

Finally, while we may still be far from AGI, there is no agreed-upon theoretical reason to expect that AGI is impossible. Since scaling current models, i.e. increasing model size, the amount of training data and the amount of compute used, has led to the emergence of qualitatively new capabilities like few-shot learning [24, 25], it cannot even be confidently ruled out that further scaling might eventually lead to AGI. All this is highly uncertain. On both sides of the debate about the likelihood of AGI development, an increase in rigorous and peer-reviewed research would be desirable. However, in this state of uncertainty, I take academic discussion about precautionary societal responses to be necessary.

Second, it is argued that, once someone can build AGI, many entities might be able to build it. This is because computing power tends to get cheaper along a roughly exponential curve [26] and because many people may attain access to algorithmic improvements, shortly after they are made. This magnifies two risks: First, malevolent actors might use AGI for their own purposes. Second, negligent actors might release AGI systems which are misaligned, i.e. don’t pursue the interests their designers want them to pursue, and begin to optimize for instrumental goals such as accumulating power. It seems plausible that AI systems which are more powerful than humans in many domains and can moreover easily be copied would have enormous destructive potential. So, even if the probability that such systems can be built in the foreseeable future is very low by ordinary standards, precautionary responses may be necessary, nonetheless.

I take these considerations to be sufficient to warrant the need to reflect on possible measures for decreasing catastrophic AI risks. Following the sentiment of Sætra and Danaher [27], I hold that this admission does nothing to detract from the importance of already existing harms caused by AI, such as algorithmic bias [28] or pollution [29, 30]. Concerns about already existing and anticipated AI harms are largely independent and may, in some cases, even support similar practical approaches.

I have included an approach to catastrophic AI risks in this paper if it strikes me as important and worthy of detailed attention, based on three (necessary but not sufficient) criteria: whether the approach has been suggested or demanded by influential researchers, public persons, or institutions, whether there are researchers, policy makers or institutions trying to implement or further develop the approach and whether the approach strikes me as broadly reasonable and potentially promising. The last factor does not mean that I think that the approach should be prioritized, but that reasonable people can endorse the approach and that it deserves further investigation. In other words, there can at least be reasonable disagreement on the merits of the approach.

2.3 Desiderata for approaches to catastrophic AI risks

In this section, I will outline four desiderata for an approach to catastrophic AI risks. The first is chance of success. This is the likelihood that this approach will be successful. There is some vagueness and ambiguity here. For research-based approaches, I take this to be the likelihood that the research actually achieves its aim, at least to a significant extent. For policy-based approaches, I take this to be the likelihood that the policy is actually implemented, at least in some sufficiently similar form to what the approach recommends. This desideratum does not address whether the research or policy is actually useful.

This is where the other desiderata come in. The next desideratum is the degree of beneficence. This is the likelihood that the approach, if successful (in the sense of the previous desideratum), will help to reduce catastrophic AI risks, and the degree to which the approach will help. In other words, the desideratum refers to the expected reduction of catastrophic AI risks by the approach, conditional on success.

The third desideratum is the degree of non-maleficence. This is the likelihood that trying the approach (that is, not conditional on success of the approach but also not conditional on failure) will cause harm, either by increasing catastrophic AI risks or some other way, and the degree to which the approach will cause harm. In other words, the desideratum refers to the expected harm of trying the approach.

The fourth desideratum is beneficent side effects. This is the likelihood that trying the approach (i.e., not conditional on success) will have positive effects other than reducing catastrophic AI risks, and the degree to which these effects are positive. In other words, the desideratum refers to the expected side benefits of trying the approach.

The motivation for this list of desiderata is clear. When evaluating an approach, we ultimately care about whether we should try to implement it. For assessing this, we need to know how likely it is that we will be successful in implementing the approach and how helpful the approach would be (in expectation) for reducing catastrophic AI risks, if implemented. However, we also need to take into account possible harms of trying to implement the approach and whether trying the approach might promote other values we care about. Importantly, possible effects of failing to implement and of successfully implementing the approach matter equally.

There are several features of this list of desiderata which deserve clarification or defense. First, this list is useful by decomposing the question whether an approach is good or useful into several more specific questions which are more tractable. However, assessing to what extent a desideratum is fulfilled is obviously still challenging and something reasonable people can disagree about. Yet, the list at least helps to make the discussion more focused. To point to a more specific challenge: it can be the case that one desideratum outweighs the others. If an approach can be expected to be very harmful, satisfying the other three desiderata fully might not be enough to compensate. Practical, informal judgement must recognize these cases.

Second, all desiderata refer to the expected consequences of approaches. Thus, one might object that the list presupposes ethical consequentialism [29]. This is not the case, however. First, the list of desiderata is not intended to pick out every morally relevant consideration. It focuses on the most important considerations. While consequentialism might not be generally true, it is plausible that, with regard to approaches for reducing catastrophic AI risks, consequentialist considerations are most important. This has two reasons: First, none of the approaches we will consider threatens to violate any important non-consequentialist constraints, e.g. against killing people. Besides, as argued previously, since there is a lot of wellbeing at stake in reducing catastrophic AI risks, consequentialist considerations assume a lot of weight.Footnote 6

I allow, however, that some people might, in accordance with a doing-allowing distinction [30], ascribe more weight to harms caused by an approach than prevented by it. This possibility is another reason to distinguish the desiderata in this way.

Third, both the desideratum of degree of non-maleficence as well as beneficent side effects point to interactions between the question how best to reduce catastrophic AI risks and other issues in AI ethics. Approaches to catastrophic risks are, all things being equal, preferable if they also help with other problems identified in AI ethics, or – at least – do not cause what AI ethicists not focused on catastrophic risks would identify as harms.

In the next two sections, we will put these desiderata to work. This will show that they are useful in evaluating approaches to catastrophic AI risks. Two further preliminary remarks: Since developments in AI are very fast and the debate on catastrophic risks relatively recent, many of the papers and ideas referred to are not (yet) peer-reviewed and have received little academic scrutiny so far. Thus, the reader should treat them with appropriate caution. Moreover, many of the issues concerned are uncertain and await further rigorous investigation. For this reason, it was unavoidable to make many controversial judgements on how big or likely specific effects of approaches are. It is virtually guaranteed that this evaluation will change significantly, given future developments in AI and policy as well as an increase in rigorous academic research on the topic. Readers may arrive at diverging judgements. However, they can nevertheless take the approach developed here as an example of how an investigation of approaches to catastrophic AI risks may proceed methodologically, and which considerations it should address.

In the next section, I will discuss research-based approaches to catastrophic AI risks.

3 Research-based approaches

3.1 Technical alignment research

The first research-based approach we will discuss is to reduce catastrophic AI risk by intensifying research on AI alignment. Let us stipulatively define the alignment problem as the problem of learning how to design AI systems such that they try to do what they ought to do. Equivalently, it is the problem of learning to design AI systems such that they pursue the goals they ought to pursue. This problem has two components: The ethical alignment problem consists in finding out which goals AI systems should pursue [31]. The technical problem is the problem of learning how to design systems such that they pursue the goals their designers want them to pursue [32, 33]. In the context of catastrophic risks, the technical alignment problem is more important, with some caveats mentioned later. It is unlikely that ethical reflection comes to the result that an AI ought to try to produce a catastrophe, such that ethical misalignment causes a catastrophic outcome.Footnote 7 Instead, the main catastrophic risk from AI misalignment consists in AI which pursues different goals than the ones we, or more specifically their designers, tried to install in them. Hence, I focus on research on the technical alignment problem. In addition, I focus specifically on alignment research aimed to prevent anticipated catastrophes, rather than focused on existing harms (e.g. training ChatGPT to not produce biased text [34,35,36]).

To evaluate the chance of success of this approach, we first need to look at the current state of technical alignment research. Since there are too many proposals with no clear agreement on the core ideas and methodsFootnote 8, this will remain a high-level, non-exhaustive overview. Moreover, in consideration of the current focus of alignment research and the probable locus of foreseeable catastrophic risk, I will limit myself to machine learning proposals.

Following Hubinger [38], we can distinguish alignment proposals according to their training goal and their training rationale. The training goal comprises the general sort of algorithm the model is intended to learn, and why learning that sort of algorithm is good. The training rationale explains why the training setup will lead the model to learn the training goal. Hubinger enumerates seven different training goals:

  1. 1.

    Loss-minimizing agents: a model which has an internal representation of the training loss and explicitly aims to minimize it.

  2. 2.

    Fully aligned agents: a model which shares the goals of humans, and optimizes for themFootnote 9

  3. 3.

    Corrigible agents: a model which allows our attempts to interfere with its behavior, including changing its programming or turning it off.

  4. 4.

    Myopic agents: a model which isn’t optimizing any sort of coherent long-term goal. Rather, its goals are limited in some way.

  5. 5.

    Simulators: a model which only simulates some other process, not an agent following goals.

  6. 6.

    Narrow agents: a model which has a high degree of capability in a narrow range of domains, but severely constrained capability in all other domains.

  7. 7.

    Truthful question-answerers: a model which exclusively truthfully reports the contents of its internal world representation, not an agent following goals.

Before briefly discussing these approaches, let us also look at training rationales. I divide the rationales mentioned by Hubinger into four classes:

  1. A)

    Capability limitations: analyzing whether a model would actually develop the capabilities to cause an undesirable or catastrophic outcome (e.g. deception).

  2. B)

    Predicting model properties in advance (e.g., via inductive bias analysis, including loss landscape analysis, or via game-theoretic or evolutionary analysis): aiming to predict what sort of model, e.g. an agent or a simulator, a particular training process would produce.

  3. C)

    Transparency and interpretability: analyzing the model’s internal processes to understand what representations and algorithms it is learning.

  4. D)

    Automated oversight: use AI models to oversee the training of other models, e.g. by providing feedback.

The training goals embody contrasting ideas regarding the specifications of a model which avoids catastrophe. Some focus on the model not being an agent [41], others on it having the right goals or some other feature, e.g. being domain-narrow or corrigible, which make it more harmless and easier to control. In the service of reducing risk, training rationales focus on predicting the properties of the model, limiting its capabilities, making it interpretable or using AI to improve the training.

I will note here some important challenges for these approaches. Interpretability research (e.g., 42,43,44) and inductive bias analysis are, with respect to cutting-edge models, still in their early stages. With automated oversight, which is the alignment proposal emphasized by OpenAI [45], it is not clear how exactly the method would be implemented. There are many doubts whether it is possible to use a model for doing alignment research without losing control over the model [46]. Currently, alignment research is based on reinforcement learning from human feedback (RLHF) [47, 48]. However, this method has fundamental limitations so that it will almost certainly break down with more advanced models [49]. Besides, many of the previous training goals and rationales might lead to a model which is significantly less powerful than models which do not obey this, or any, alignment proposal. In this case, there are commercial and political incentives to build the models which do not satisfy the proposal.

The upshot is that estimating the chance of success of alignment research is very hard. The field is in its infancy and still largely in an exploratory phase. More work is required to better understand the precise shape of the problem and the space of possible solutions. In addition, predicting the speed and form of scientific progress is notoriously difficult. That being said, prima facie, the problem seems hard to solve, while it is an open question whether it is maybe even unsolvable.

Alignment research has a very high degree of beneficence. Conditional on the development of sufficiently powerful AI, solving the alignment problem seems to be a necessary condition for avoiding catastrophe. For suppose models are build which are capable of long-term strategic planning in many different domains and control critical (e.g. military) infrastructure. At a certain level of capability or influence, these systems could cause enormous damage if they pursue the wrong goals. Alignment research is necessary to ensure that AI systems do not pursue dangerous goals. The only scenarios where alignment research would not be necessary are if (a) sufficiently powerful AI is never developed or (b) the alignment problem turns out to be trivial in the sense that no dedicated ‘alignment research’ is needed for reliably building aligned AI.Footnote 10

Alignment research also has high beneficent side effects. Even in the absence of catastrophes, alignment research dedicated to reducing AI catastrophe may be useful to reliably train models to show desirable behavior. For instance, RLHF – an alignment technique – is used to reduce biased and confidently false statements (hallucinations) in large-language models (LLMs) like GPT-4. In general, learning to design a model such that it pursues exactly the goals you want it to might be necessary to maximize the societal utility of AI.

However, alignment research has a low degree of non-maleficence. There are two categories of concerns. First, aligning a model means designing it such that it acts in accordance with human goals. Having perfectly aligned AI arguably increases the risk that it will be intentionally misused since it makes it easier for designers to use the system to further their own goals [50]. This also intensifies pre-existing concerns about power concentration in the hands of AI companies [51] or governments.

Second, alignment might speed up progress in AI research and thus lead to a faster deployment of dangerous models. This could be either because aligned systems are more useful or because alignment research can let dangerous systems appear harmless. With respect to the first worry, research on alignment and research on making models more powerful are not always cleanly separable. For instance, the fine-tuning via RLHF used to align ChatGPT makes it better at many kinds of tasks, e.g. giving correct answers. Since aligned systems are more useful, there are stronger incentives to accelerate their development and deploy them recklessly.

Moving on to the second worry, alignment research might cause us to believe that models are harmless even if they are dangerous. In particular, certain alignment paradigms, like training models such that they get positive human feedback via RLHF, might train models to appear aligned while being misaligned. For training via human feedback incentivizes models to say what humans believe is true, rather than what is true, and to hide behavior humans would disapprove of [34, 52]. In the worst case, this might lead people to think a model is harmless, deploy it and cause a catastrophe.

While the possible harms of alignment research are important, several considerations suggest that they are probably surmountable, if handled carefully. If one proceeds responsibly, then alignment research can be directed specifically at avenues which are not expected to have an offsetting effect on capabilities. Similarly, one ought to guard against training models to appear aligned, without sufficient certainty that they are aligned and safe. Risks from misusing AI are very important, but need to be approached with utmost care in any scenario where AI capability advances continue.

3.2 AI timelines research

The goal of AI timelines research is to better predict how AI will develop in the future. The most common type of research question is when cutting-edge models will reach a certain level of capability. Particular attention has been devoted to the question when “transformative AI” will be built [53]. Recall that a model is considered transformative if running many copies of that model (once trained) would have “at least as profound an impact on the world’s trajectory as the Industrial Revolution did” [21]. However, not all work focuses on transformative AI, construed in this way [54].

Similar to alignment research, timelines research is very hard. It is plausible that timelines research is even more unlikely to come to clear and precise results since AI timelines depend on societal processes which are partially chaotic, e.g. the outbreak of wars, and the speed of scientific progress which is, to some extent, inherently unpredictable. To date, quantitative models of AI timelines proceed by extrapolating trends in the availability of compute [21, 54, 55].Footnote 11

However, the results of this method rest on assumptions which cannot be verified. Many researchers deny that availability of compute is the key bottleneck to achieving transformative AI or AGI. They think that current models are the wrong type of thing to become AGI [56,57,58,59,60]. According to this view, scientific breakthroughs, not (just) more compute (and data), are required. On the other hand, even if more compute is the key ingredient to AGI, it is very unclear how much compute is necessary and there is no reliable method of answering this question (except building AGI). Hence, the chance of success of timelines research is low.

Its degree of beneficence is moderate. Knowledge about AI timelines is relevant for many kinds of decisions. First, how urgent AI in general as well as specific risks and benefits are depends on timelines. All other things being equal, dealing with AI should be prioritized more if timelines are shorter. Second, specific decisions are sensitive to timelines. For instance, alignment proposals which assume that AGI will be continuous with current methods are more justified given shorter timelines, while proposals which depend on a large foundational research project still needing to be developed require longer timelines.

So, knowledge about timelines can inform approaches needed to remedy catastrophic AI risks. At the same time, knowing AI timelines in itself does nothing to reduce such risks. Its value is limited to informing other approaches.

The potential harm from timelines research appears low. If one believes that speeding up AI progress is bad (e.g., because of catastrophic risks), then timelines research might be harmful by convincing people that AGI is feasible in the not-too-distant future. If timelines research results in the view that AGI is possible soon, this might incentivize people to invest more resources into AI research. However, if this scenario comes about, I estimate this effect to be low, partially because increasing attention to AI seems inevitable either way.

In terms of beneficent side effects, one may hope that timelines research may also inform decisions not relevant to catastrophic risks. However, since most other issues in AI ethics have a shorter time-horizon, I think timelines research has low beneficial side effects.

3.3 Research on AI policy

If there are catastrophic AI risks, it is likely that they ultimately require a political response. Not only that, but political regulation of AI is already being designed (e.g. the Artificial Intelligence Act in the EU). It is not clear which parts of it target catastrophic AI risks versus existing AI harms, but catastrophic risks, up to extinction, have at least been mentioned by members of the UK and US government [61]. At the same time, there is a lot of uncertainty about which policies would be effective and proportionate to reduce catastrophic AI risk. In this situation, there seems to be an important role for research on AI policy.

Since we will speak about concrete policy ideas later, I will not recapitulate these proposals here (for an overview of recent AI governance research, see Maas [62]). Instead, I will rely on general considerations which help to narrow down how AI policy research relates to our desiderata. More than with alignment and policy research, success of AI policy research is something gradual. A partial success can consist, e.g., in understanding the risks and benefits of a specific policy better, even if the general question – what the best combination of policies to reduce catastrophic AI risks is – is still open. Thus, some progress seems possible.

At the same time, using the strict standard of success we have used with the previous approaches, the chance of success of arriving at a confident, true verdict regarding the best combination of policies for catastrophic AI risks is relatively low. In policy research about complex topics, it is typically hard to come up with clear and evidence-based answers to the question what effects a policy will have, how different policies and other factors interact and how to evaluate these effects. Typically, some disagreement persists.

The degree of beneficence of AI policy research is high. Without systematic research on policy for reducing catastrophic AI risk, reducing such risks is significantly less likely. For without systematically evaluating policies, we could only reduce catastrophic risk if (a) such risks don’t require a political response, (b) the correct political response is obvious or otherwise can be identified without dedicated research or (c) the correct political response is chosen by random luck. Since all scenarios combined seem relatively unlikely (at least to me), AI policy research is probably indispensable to tackle catastrophic AI risks.

AI policy research scores very high on degree of non-maleficence since it is unclear how it might cause, in expectation, harm. Obviously, the ideas resulting from such research might be bad, but it seems plausible that ideas informed by research have a higher chance of being adequate than ideas which are not informed by research. This approach has moderate beneficent side-effects. Similar to alignment research, research on policies aimed at reducing catastrophic AI risks may also fruitfully inform policy on AI which does not involve catastrophic risks.

In the next section, we will consider policy-based approaches to catastrophic AI risk.

4 Policy-based approaches

4.1 Halting AI research

The first approach is to embrace policies to halt, or slow down, cutting-edge AI research in general. It is not exactly clear what specifically qualifies as cutting-edge research but, in the present, these would be large foundation models like GPT-4. A model is cutting-edge, in the relevant sense, when it belongs to the most capable models currently available. Most interesting, from the standpoint of catastrophic risks, are either models which are very capable in a wide range of domains, like LLMs, or in a specific domain relevant to catastrophic risks, e.g., models for military applications. Currently, general models are more important because they might pose hard to anticipate risks in a variety of domains.

The approach can either consist in halting AI progress or in making it slower. Moreover, such a measure can either be temporary or permanent. While we treat this here as one unified approach, following the same core rationale, it is important to explore differences between different versions of the approach as well.

Overall, I take the chance of success of this approach to be moderate. There are strong incentives to continue with cutting-edge AI research. The leading companies have already invested a lot of money into AI research and desire a return to their investment. While governments are sensitive to a variety of interests, they are also influenced by lobbying efforts of important companies. More importantly still, governments aim to avoid falling behind in economic and military competition with other countries. To the extent that cutting-edge AI is important for economic and military capability, governments have strong incentives to push forward, rather than slow down, AI development.

Internationally, the situation may be a coordination problem. Suppose that everyone agreed that halting AI progress is a better solution for all countries involved than no one halting AI progress: Nevertheless, a single country can gain an advantage by not obeying an agreement to halt AI research, reaping the benefits of AI advances, and using them in competition with other countries. Hence, such coordination may be hard to achieve.

An additional challenge, specific to AI, is that progress in cutting-edge AI is to a significant extent driven by decreases in the price of compute, which leads to a higher availability of compute for training cutting-edge models [54, 55, 60]. This is a problem because the decreasing cost of compute is independent of dedicated AI research efforts. Thus, even if no dedicated AI research is conducted, companies and governments will likely be capable of building more powerful AI models just by using more compute.

But there also factors supporting optimism about the ability to slow down or halt AI research. As evidenced by an open letter in 2023 [64], there are many researchers, entrepreneurs and publicly influential personality who support at least a temporary ban on the most disruptive forms of cutting-edge AI research. In addition, there is some indication that the public is critical of AI progress. In a recent poll by the Artificial Intelligence Policy Institute, 62% of respondents were somewhat or mostly concerned about growth in AI, while only 21% were somewhat or mostly excited [65]. So, it is plausible that a strong coalition aiming to slow down or halt AI progress might emerge. It seems clear that crafting regulation to temporarily slow down cutting-edge AI is more feasible than regulation to permanently halt it.

The approach is very highly beneficent. Permanently stopping cutting-edge AI research mayeliminate all catastrophic risks from AI. The use of temporarily pausing, or slowing down, AI research is still significant [66]. Such a moderate measure provides politics, society, and research (esp. alignment research) with more time to respond and adapt to technological changes. Furthermore, it affords time to monitor released models for safety risks and eliminate them before more advanced models are deployed. Having time to understand model capacities and risks before building even more powerful models would be very helpful for safety. Also, to the extent that alignment and policy research are central, having less time pressure for such research would be very desirable. Finally, enacting and monitoring a pause in cutting-edge AI research may contribute to learning how to coordinate around AI in the future. This experience may be valuable if certain future models are discovered to pose high risks of catastrophe and thus need to be banned permanently and across the board.

However, the approach poses a high risk of harms. First, to the extent that AI development, e.g. by boosting economic growth and scientific research, is beneficent, the approach is harmful because it prevents or slows down the benefits from AI progress. Second, the actors which are most likely to implement and adhere to a moratorium on AI research are probably actors which are most responsive to risks from AI and thus tend to be more responsible in this regard. When AI research is slowed down through regulation, but this regulation is not comprehensively adopted and executed, then this may enable the most reckless actors, e.g. specific nations, to gain a lead in AI research and empower them to build cutting-edge AI. This would be particularly bad for catastrophic (and other) risks.

Third, regulation to slow down or temporarily halt AI research could make technical progress more discontinuous. As mentioned above, AI progress is chiefly driven by increased availability of compute. Now suppose designing more advanced models is prohibited for ten years. After ten years, there will be much more compute available. The increased availability of compute might lead to a rapid, discontinuous jump in the capabilities of cutting-edge AI models which poses specific risks. If AI progress is continuous, we can monitor each generation of models for signs that they are starting to develop dangerous properties and, if so, respond. By contrast, if progress is discontinuous, a range of dangerous properties might suddenly appear without prior warning, making a catastrophe more likely.

Lastly, the approach has very high beneficent side effects. It seems plausible that new non-catastrophic risks from AI derive to a significant extent from progress in cutting-edge models which leads to a broader influence of AI in society, e.g., in making important decisions about people’s lives [67], and causes new challenges, e.g. the automation of jobs [68]. For the same reasons mentioned earlier, slowing down or halting AI research can prevent, or help mitigate, such risks.

4.2 Compute governance

A variety of concrete policy proposals to regulate AI with respect to catastrophic risks has recently been made [17, 69, 70]. Many of these proposals are more specific than the high-level approaches discussed so far. Due to constraints of space, it is not possible to evaluate, or even mention, most of them here. Yet, I will talk about compute governance as a general approach which begins to be fleshed out with specific policy proposals. Compute governance aims to influence, or direct, AI development and deployment by controlling and governing access to computational resources.

I will focus on two policy ideas mentioned by Muehlhauser [69]. They are by no means the only policy ideas falling under the label compute governance worth thinking about, but reflecting on them is a good basis for further research. First, software export controls can be used to control the export of cutting-edge AI models to other countries (or even entities within a country). What counts as a cutting-edge model could be operationalized by the amount of compute (measured in US dollar) used to train it. I count this proposal as an instance of compute governance, since it defines relevant software in terms of the compute used for developing it. Second, cutting-edge computer chips, i.e. chips over a certain capability threshold, could be tracked, and governments could require a license to bring together large numbers of them.

I estimate that these two policy proposals have a high chance of success, i.e. implementation. In contrast to halting or slowing AI research, they are compatible with commercial, military and other incentives which favor fast AI development. Moreover, since the number of hardware chips can be measured and tracked precisely, these policies can be operationalized well.

The approach is moderately beneficent. It represents an important angle to regulate AI in the medium-term. Without software export controls and tracking cutting-edge chips, the most capable models proliferate, and government is blind to potentially dangerous clusters of compute. Dangerous developments might happen in spheres which governments cannot perceive or intervene on. In these circumstances, the chances for effective regulation of AI are severely diminished. Hence, the approach has an important role in supporting and enabling other AI regulation efforts.

Yet, its long-term potential is limited. So far, compute has been getting cheaper along a roughly exponential path [24]. This means that compute clusters which are now considered massive should be expected to fall behind the computing power of a normal common-usage desktop computer of the future, eventually. If that happens, the two policies above will either have to apply to every computer without exception or they will neglect compute clusters which could lead to the emergence of dangerous models. If the policies are supposed to apply to every computer, they are harder to implement.

More importantly though, these policies lose their specific usefulness when they cannot distinguish between harmless and dangerous amounts of computing power because each single computer has dangerous amounts. If some intervention is taken based on this classification, e.g. requiring a license for handling dangerous amounts of compute, this will no longer be a targeted intervention but a large-scale effort to slow down or halt AI use and research.

The approach appears moderately high on non-maleficence. Restricting exports of cutting-edge chips and requiring licenses for big compute clusters may moderately slow down economic growth, but not to an extent remotely comparable with a general moratorium on AI. Tracking compute clusters is a form of surveillance. However, since no personalized, private information needs to be transmitted, this form of surveillance seems relatively innocuous. While compute governance proposals along the lines above are highly non-maleficent, specific measures taken on their basis, e.g. prohibiting training runs using more than a certain amount of compute, can of course be harmful (or very helpful).

Finally, the approach has moderate beneficent side effects. The approach can also be used to tackle non-catastrophic risks from AI, as long as these risks tend to arise from the most capable models (which tend to require the most compute). Since the long-term viability of the approach is questionable, because over time models with dangerous capabilities will proliferate due to decreasing costs of compute, one may argue that the approach is even better suited to dealing with non-catastrophic risks which tend to have a shorter time horizon.

5 Conclusion

In this paper, I have provided reasons to take putative catastrophic risks from AI sufficiently seriously to begin reflecting on potential responses. Subsequently, I have evaluated five approaches to catastrophic risks from AI based on a list of four desiderata. For illustration, I have summarized the qualitative assessment presented in the previous sections in Table 1 using numbers.

Table 1 For illustration, this table depicts the five approaches discussed in the main text and how they may score on each of the four desiderata on a scale from 1 to 5

These numbers are translations from the qualitative evaluations presented earlier: 5/5 equals very high, 4/5 equals (moderately) high, 3/5 equals moderate, 2/5 equals (moderately) low and 1/5 equals very low. Note that high uncertainty regarding the correct score has been translated to a moderate score.

As such, these numbers are in no way more objective or reliable than the qualitative evaluations they are derived from. They come with a high degree of uncertainty because they concern approaches to anticipated risks, whose degree and nature are hard to estimate. Moreover, scientific research on both the kind and degree of catastrophic AI risks as well as possible responses is in very early stages. This is also the reason why many of the papers I cited are not peer-reviewed and should be treated with appropriate care. It is to be expected that this evaluation will change significantly, given future developments in AI and policy as well as an increase in rigorous academic research on the topic.

Finally, this table is not supposed to embody a ranking of these approaches. While some approaches might be more important or beneficial overall, all of them have distinctive potentials and risks. Particularly, as I have emphasized throughout, they often complement or even depend on each other. Pausing AI research, for instance, is only useful when other approaches are pursued during the pause. These potentials, risks and interactions need to be considered when adopting an approach, or a combination of approaches.

I have to emphasize that more research is needed on both evaluating the high-level approaches I have discussed as well as generating detailed research and policy proposals in line with these approaches. [Insert footnote: 'I also note that some relevant contributions to the literature have not been mentioned in this paper, because they appeared while this paper was undergoing peer-review.'] Apart from that, due to constraints of space, I have left many promising policy approaches out (e.g., 66, 67). In particular, there are many approaches which I estimate would be low on beneficence but very high on non-maleficence. While not being decisive if considered alone, such policy approaches may constitute clear improvements which lead to marginal reductions in catastrophic risk, without much potential for harm.

Despite the call for more research, two lessons are suggested by this investigation: First, the effects of particular catastrophic risk approaches for present-day AI harms are complex, diverse and need to be examined on a case-by-case basis. In some instances, approaches which help with catastrophic risks are similarly beneficial for existing harms. These cases especially allow for cooperation between people warning about anticipated risks and researchers focused on present harms. However, certain approaches – both on a high-level and when we conceptualized as detailed policy proposals – are potentially effective against catastrophic risks but not against present harms or vice versa. In these cases, compromises need to be made. Moreover, due to these cases, estimating the likelihood and severity of catastrophic AI risk and present-day harms is an important target for future research.

Second, there are many approaches available which are suitable for reducing catastrophic AI risks. While possible harms and other effects also need to be considered, we are beginning to build a toolset which can be used to respond adequately to catastrophic risks from AI.