1 Introduction

In decision theory, there is a large debate about how an instrumentally rational agent, i.e., an agent trying to achieve some goal or maximize some utility function, should decide in Newcomb’s problem (introduced by Nozick 1969) and variations thereof (a list is given by Ledwig 2000, pp. 80–87). Consequently, different normative theories of instrumental rationality have been developed. The best known ones are evidential (sometimes also called Bayesian) decision theory (EDT) (Ahmed 2014; Almond 2010; Price 1986; Horgan 1981) and causal decision theory (CDT) (Gibbard and Harper 1981; Joyce 1999; Lewis 1981; Skyrms 1982; Weirich 2016), but many have attempted to remediate what they view as failures of the two theories by proposing further alternatives (Spohn 2003, 2012; Poellinger 2013; Arntzenius 2008; Gustafsson 2011; Wedgwood 2013; Dohrn 2015; Price 2012; Soares and Levinstein 2017).

Because the main goal of artificial intelligence is to build machines that make instrumentally rational decisions (Russell and Norvig 2010, Sects. 1.1.4, 2.2; Legg and Hutter 2007; Doyle 1992), this normative disagreement has some bearing on how to build these machines (cf. Soares and Fallenstein 2014a, Sect. 2.2; Soares and Fallenstein 2014b, Sect. 1; Bostrom 2014b, Chap. 13, Sect. “Decision theory”). The differences between these decision theories are probably inconsequential in most situations (Ahmed 2014, Sect. 0.5, Chap. 4; Briggs 2017),Footnote 1 but still matter in some (Ahmed 2014, Chap. 4–6; Soares 2014a; Bostrom 2014a). In fact, AI may expose the differences more often. For example, Newcomb’s problem and the prisoner’s dilemma with a replica (Kuhn 2017, Sect. 7) are easy to implement for agents with copyable source code (cf. Yudkowsky 2010 pp. 85ff. Soares and Fallenstein 2014b, Sect. 2; Soares 2014b; Cavalcanti 2010; Sect. 5). Indeed, the existence of many copies is the norm for (successful) software, including AI-based software. While copies of present-day software systems may only interact with each other in rigid, explicitly pre-programmed ways, future AI-based systems will make decisions in a more autonomous, flexible and goal-driven way. Overall, the decision theory of Newcomb-like scenarios is a central foundational issue which will plausibly become practically important in the longer term.

The problem for AI research posed by the disagreement among decision theorists can be divided into two questions:

  1. 1.

    What decision theory do we want an AI to follow?

  2. 2.

    How could we implement such a decision theory in an AI? Or: How do decision theories and AI frameworks or architectures map onto each other?

Although it certainly requires further discussion, there already is a large literature related to the first question.Footnote 2 In this paper, I would thus like to draw attention to the second question.

Specifically, I would like to investigate how approval-directed agents behave in Newcomb-like problems. By an approval-directed agent, I mean an agent that is coupled with an overseer. After the agent has chosen an action, the overseer scores the agent for that action. Rather than, say, trying to bring about particular states in the environment, the agent chooses actions so as to maximize the score it receives from the overseer (cf. Christiano 2014). A model of approval-directed agency that allows us to describe Newcomb-like situations is described and discussed in Sect. 2.

Approval-directed agency is intended as a model of reinforcement learning agents (see Sutton and Barto 1998; Russell and Norvig 2010; Chaps. 17, 21, for introductions to reinforcement learning), for whom the reward function is analogous to the approval-directed agent’s overseer. Since reinforcement learning is such a general and commonly studied problem in artificial intelligence (Hutter 2005, e.g. Chap. 4.1.3; Russell and Norvig 2010, p. 831; Sutton and Barto 1998, Chap. 1), it is an especially attractive target for modeling.Footnote 3 However, because decision theories are usually defined only for single decisions, we will only discuss single decisions whereas reinforcement learning is usually concerned with sequential interactions of agent and environment. However, this decision can also be a policy choice to model sequential decision problems.Footnote 4 In addition to limiting our analysis to single decisions, we will not discuss the learning process and simply assume that the agent has already formed some model of the world.

If we assume that, after an action has been taken, the overseer rewards the agent based on the expected value of some von Neumann–Morgenstern utility function, the agent is implicitly driven by two decision theories: The overseer can use the regular conditional expectation or the causal expectation to estimate the value of its utility function; and the agent itself can follow CDT or EDT when maximizing the score it receives from the overseer (Sect. 3).

Fig. 1
figure 1

A causal model of an approval-directed agent in a Newcomb-like decision problem. A denotes the agent’s action, H the environment history, O the observation on which the overseer bases the reward, R is that reward, and \(E_r\) is information about the way the reward is computed that is only available to the overseer. The box is used to indicate that H includes the two random variables \(H_p\) and \(H_f\). All of H may have a causal influence on O

We then show how the overall decision theory depends on these two potentially conflicting decision theories. If the overseer bases its expected value calculations on looking only at the world, then the agent’s decision theory is decisive. If the overseer bases its estimates only on the agent’s action, then the overseer’s decision (or perhaps rather action evaluation) theory is decisive.

2 Approval-directed agency

We first describe a model of approval-directed agency. To be able to apply both CDT and EDT, we will use causal models in Pearl’s (2009) sense. Consequently, we use Pearl’s do-calculus-based version of CDT (Pearl 2009, Chap. 4). We will, throughout this paper, assume that the agent has already formed a (potentially implicit) model of the worldFootnote 5—e.g., based on past interactions with the environment. Also, we will only consider single decisions rather than sequential problems of iterative interaction between agent and environment.

A causal model of such a one-shot Newcomb problem from the perspective of the approval-directed agent is given in Fig. 1. In this model, the agent decides to take some action A, which may causally affect some part of the environment history, i.e., the history of states, H. We will call that part of the history the agent’s causal future \(H_f\). Furthermore, the agent may be causally influenced by some other part of the environment history, which we will call the agent’s causal past \(H_p\). H may contain information other than \(H_f\) and \(H_p\), which we will assume to be independent of A.Footnote 6 The overseer, physically realized by, e.g., some module physically attached to the agent or a human supervisor, observes the agent’s action and partially, via some percept O, the state of the worldFootnote 7. The overseer then calculates the reward R. To set proper incentives to the agent, we will assume the overseer to know not only the action and observation, but also everything that the agent knows (cf. Christiano 2016). The overseer may also have access to some additional piece of information \(E_r\) about the way the reward is to be calculated.Footnote 8 Lastly, we assume that the sets of possible values of A, O and \(E_r\) are finite.

In principle, the overseer could reward the agent in all kinds of ways. E.g., it could reward the agent “deontologically” (Alexander and Moore 2016) for taking a particular action independently of the consequences of taking that action. In this paper, we will assume that the reward estimates the value of some von Neumann–Morgenstern utility function U that only depends on states of the world. I use the capital U to indicate that the utility function, too, is a random variable (in the Bayesian sense). For simplicity’s sake, we will, again, assume that the set of possible values of U is finite.

We will view U as representing the system designer’s preferences over world states.Footnote 9 While other ways of assigning the reward are possible, this is certainly an attractive way of getting an approval-directed agent to achieve goals that we want it to achieve. After all, in real-world applications, we will usually care about the outcomes of the agent’s decisions, such as whether a car has reached its destination in time or whether a human has been hurt.

The standard way of estimating U(H) (or any quantity for that matter) is the familiar conditional expectation. Thus, the overseer may compute the reward as

$$\begin{aligned} r=\mathbb {E}\left[ U(H) \mid e_r, a, o \right] , \end{aligned}$$
(1)

where r, a, \(e_r\), and o are values of R, A, \(E_r\), and O, respectively.Footnote 10

A causal decision theorist overseer agrees that after an action a is taken the right-hand side of Eq. 1 most accurately estimates how much utility is achieved. She merely thinks that this term should not be used to decide which action a to take in the first place.Footnote 11 However, this puts a causal decision theorist overseer in a peculiar situation. Whatever formula she uses to compute the reward will also be used by the reward-maximizing agent to decide which action to take. A causal decision theorist overseer might therefore worry (rightfully, as we will see) that providing rewards according to Eq. 1 will make the agent EDT-ish. Hence, she either has to incorrectly estimate how much utility was achieved; or live with the agent using an—in her mind—incorrect way of weighing her options. If she prefers the latter, she would reward according to Eq. 1. But arguably getting the agent to choose correctly is the overseer’s primary goal. Thus, she might prefer to compute the reward according to

$$\begin{aligned} r=\mathbb {E}\left[ U(H) \mid e_r, do(a), o \right] . \end{aligned}$$
(2)

Here, do(a) refers to Pearl’s do-calculus, where conditioning on do(a) roughly means intervening from outside the causal model to set A to a. For an introduction to the do-calculus, see Pearl (2009). Although a causal decision theorist overseer may prefer computing rewards according to Eq. 1, we will from now on say “the overseer uses CDT” if rewards are computed according to Eq. 2 and “the overseer uses EDT” if rewards are calculated according to Eq. 1.

An approval-directed agent is characterized by maximizing the reward it receives from the overseer.Footnote 12 However, decision theory offers us, again, (at least) two different expected values, the regular expected value of EDT

$$\begin{aligned} \mathbb {E}\left[ R \mid a \right] , \end{aligned}$$
(3)

and CDT’s causal expected value

$$\begin{aligned} \mathbb {E}\left[ R \mid do(a) \right] . \end{aligned}$$
(4)

We leave the interesting question of which (if any) decision theory describes the behavior of current reinforcement learning algorithms to future researchFootnote 13 and in the following assume that the agent is known to implement either CDT or EDT.

3 The conflict of the decision theories of agent and overseer

When viewed together with the overseer, our agent may now be seen as containing two decision theories, one for computing the reward and one in the algorithm that tries to find the action to maximize that reward. These decision theories may not always be the same. Given this potential discrepancy, the question is which of the two decision theories prevails, i.e., for which configurations of the two decision theories the overall agent acts like a CDT agent and for which it acts like an EDT agent w.r.t. U.

As it turns out, the answer to this question depends on the decision problem in question. In particular, it depends on whether the overseer updates its estimate of U(H) primarily based on the action taken by the agent or on its observation of the environment.

For illustration, consider two versions of Newcomb’s problem. In both versions, the predictor is equally reliable—e.g., correct with 90% probability—and the potential box contents are the same—e.g., the standard $1K and $1M. As usual, the content of the opaque box cannot be causally influenced by one’s decision. In the first version, the overseer eventually sees the payoff, i.e., how much money the agent has made. In this case, as soon as the money is observed, the overseer’s estimate of U(H) becomes independent of the agent’s action. More generally, O may tell the overseer so much about U(H) that it becomes independent of A even if U(H) is not yet fully observed. That is,

$$\begin{aligned} \mathbb {E}\left[ U(H) \mid e_r, a, o \right] = \mathbb {E}\left[ U(H) \mid e_r, o \right] \end{aligned}$$
(5)

and

$$\begin{aligned} \mathbb {E}\left[ U(H) \mid e_r, do(a), o \right] = \mathbb {E}\left[ U(H) \mid e_r, o \right] \end{aligned}$$
(6)

for all \(e_r\), a and o. Note that neither of these two implies the other.Footnote 14 Intuitively speaking, these two mean that the reward is ultimately determined by U(H).

In the second version of Newcomb’s problem, the monetary payoff is not observed but covertly invested into increasing the agent’s utility function. Only the agent’s choice can then inform the overseer about U(H). Formally, it is both

$$\begin{aligned} \mathbb {E}\left[ U(H) \mid e_r, a, o \right] = \mathbb {E}\left[ U(H) \mid e_r, a \right] \end{aligned}$$
(7)

and

$$\begin{aligned} \mathbb {E}\left[ U(H) \mid e_r, do(a), o \right] = \mathbb {E}\left[ U(H) \mid e_r, do(a) \right] . \end{aligned}$$
(8)

Intuitively speaking, these two equations mean that the reward is not determined by U(H) but by what the overseer believes U(H) will be given a or do(a).

Again, we assume that this is known to the agent. An example class of cases is that in which the agent’s decisions are correlated with those of agents in far-away parts of the environment (cf. Treutlein and Oesterheld 2017; Oesterheld 2018b). The two versions are depicted in Fig. 2.

Fig. 2
figure 2

Two different ways in which the overseer can calculate the reward

Of course, these are only the two extremes from the set of all possible situations. In real-world Newcomb-like scenarios, the overseer may also draw some information from both sources. Nonetheless, it seems useful to understand the extreme cases, as this may also help us understand mixed ones.

In the following subsections, we will show that in the first type, the decision theory of the agent is decisive, whereas in the second type, the overseer’s decision theory isFootnote 15. Roughly, the reason for that is the following: As noted earlier, the reward in the first type depends directly on U(H). Thus, the agent will try to maximize U(H) according to its own decision theory. In the second type, the overseer takes the agent’s action a and then considers what either a or do(a) says about U(H). Thus, the agent has to pay careful attention to whether the overseer uses EDT’s or CDT’s expected value.

We prove this formally by considering all possible configurations of the type of the problem, the overseer’s decision theory and the agent’s decision theory. While we will limit our analysis to EDT and CDT, the results can easily be generalized to variants of these that arise from modifying the causal model or conditional credence distribution (e.g. Yudkowsky 2010; “Disposition-based decision theory”; Spohn 2012; Dohrn 2015). The analysis is summarized in Table 1.

Table 1 An overview of the results of the calculations in Sect. 3

3.1 First type

3.1.1 The EDT agent

The EDT agent judges its action by

$$\begin{aligned} \mathbb {E}\left[ R \mid a \right] . \end{aligned}$$
(9)

If the overseer calculates regular conditional expectation, then it is

$$\begin{aligned} \mathbb {E}\left[ R \mid a \right]= & {} \mathbb {E}\left[ \mathbb {E}\left[ U(H) \mid E_r, O, a \right] \mid a \right] \end{aligned}$$
(10)
$$\begin{aligned}= & {} \mathbb {E} \left[ U(H) \mid a \right] , \end{aligned}$$
(11)

where the last line is due to what is sometimes called the law of total expectation (LTE) or the tower rule (see, e.g., Ross 2007, Sect. 3.4; Billingsley 1995, Theorem 34.4). Intuitively, you cannot expect that gaining more evidence (i.e., \(E_r\) and O in addition to a) moves your expectation of U(H) into any particular direction.

Because the overseer knows more than the agent, we will need this rule in all of the following derivations. Its application makes it hard to generalize these results to other decision theories, since LTE does not apply if the two decision theories do not both compute a form of expected utility.

Equations 10 and 11 show that if the overseer computes regular expected value and the agent maximizes the reward according to EDT, then the agent as a whole maximizes U according to EDT.

If the overseer computes CDT’s expected value, it is

$$\begin{aligned} \mathbb {E}\left[ R \mid a \right]= & {} \mathbb {E}\left[ \mathbb {E}\left[ U(H) \mid E_r, do(a), O \right] \mid a \right] \end{aligned}$$
(12)
$$\begin{aligned}= & {} \sum _{e_r,o} P(e_r,o\mid a) \cdot \mathbb {E}\left[ U(H) \mid e_r, do(a), o \right] \end{aligned}$$
(13)
$$\begin{aligned}&\underset{\hbox {eq. }5\hbox { and } 6}{=}&\sum _{e_r,o} P(e_r,o\mid a) \cdot \mathbb {E}\left[ U(H) \mid e_r, a, o \right] \end{aligned}$$
(14)
$$\begin{aligned}= & {} \mathbb {E}\left[ \mathbb {E}\left[ U(H) \mid E_r, a, O \right] \mid a \right] \end{aligned}$$
(15)
$$\begin{aligned}&\underset{\text {LTE}}{=}&\mathbb {E}\left[ U(H) \mid a \right] \end{aligned}$$
(16)

3.1.2 The CDT agent

The CDT agent judges its action by

$$\begin{aligned} \mathbb {E}\left[ R \mid do(a) \right] . \end{aligned}$$
(17)

If the overseer uses regular expected value (EDT), then

$$\begin{aligned} \mathbb {E}\left[ R \mid do(a) \right]= & {} \mathbb {E}\left[ \mathbb {E}\left[ U(H) \mid a, O, E_r \right] \mid do(a) \right] \end{aligned}$$
(18)
$$\begin{aligned}= & {} \sum _{e_r,o} P(e_r,o\mid do(a)) \cdot \mathbb {E}\left[ U(H) \mid a, o, e_r \right] \end{aligned}$$
(19)
$$\begin{aligned}&\underset{\hbox {eq. }5\hbox { and } 6}{=}&\sum _{e_r,o} P(e_r,o\mid do(a)) \cdot \mathbb {E}\left[ U(H) \mid do(a), o, e_r \right] \end{aligned}$$
(20)
$$\begin{aligned}= & {} \mathbb {E}\left[ \mathbb {E}\left[ U(H) \mid do(a), O, E_r \right] \mid do(a) \right] \end{aligned}$$
(21)
$$\begin{aligned}&\underset{\text {LTE}}{=}&\mathbb {E}\left[ U(H) \mid do(a) \right] \end{aligned}$$
(22)

Learning about an intervention do(a) cannot always be treated in the same way as learning about other events. Hence, the application of the law of total expectation is not straightforward. However, \(P(\cdot \mid do(x))\) is always a probability distribution. Because the law of total expectation applies to all probability distributions, it also applies to ones resulting from the application of the do-calculus.

If the overseer uses CDT’s expected value, then

$$\begin{aligned} \mathbb {E}\left[ R \mid do(a) \right]= & {} \mathbb {E}\left[ \mathbb {E}\left[ U(H) \mid E_r, O, do(a) \right] \mid do(a) \right] \end{aligned}$$
(23)
$$\begin{aligned}&\underset{\text {LTE}}{=}&\mathbb {E}\left[ U(H) \mid do(a) \right] . \end{aligned}$$
(24)

3.2 Second type

3.2.1 The EDT agent

The EDT agent judges its actions by

$$\begin{aligned} \mathbb {E}\left[ R \mid a \right] . \end{aligned}$$
(25)

If the overseer is based on regular conditional expectation (EDT), then it is again

$$\begin{aligned} \mathbb {E}\left[ R \mid a \right]= & {} \mathbb {E}\left[ \mathbb {E}\left[ U(H) \mid E_r, a \right] \mid a \right] \end{aligned}$$
(26)
$$\begin{aligned}&\underset{\text {LTE}}{=}&\mathbb {E}\left[ U(H) \mid a \right] . \end{aligned}$$
(27)

If the overseer is based on CDT-type expectation, then

$$\begin{aligned} \mathbb {E}\left[ R \mid a \right]= & {} \mathbb {E}\left[ \mathbb {E}\left[ U(H) \mid E_r, do(a) \right] \mid a \right] \end{aligned}$$
(28)
$$\begin{aligned}= & {} \sum _{e_r} P(e_r\mid a) \cdot \mathbb {E} \left[ U(H) \mid do(a), e_r \right] \end{aligned}$$
(29)
$$\begin{aligned}= & {} \sum _{e_r} P(e_r ) \cdot \mathbb {E} \left[ U(H) \mid do(a), e_r \right] \end{aligned}$$
(30)
$$\begin{aligned}= & {} \sum _{e_r} P(e_r\mid do(a)) \cdot \mathbb {E} \left[ U(H) \mid do(a), e_r \right] \end{aligned}$$
(31)
$$\begin{aligned}= & {} \mathbb {E}\left[ \mathbb {E}\left[ U(H) \mid E_r, do(a) \right] \mid do(a) \right] \end{aligned}$$
(32)
$$\begin{aligned}&\underset{\text {LTE}}{=}&\mathbb {E}\left[ U(H) \mid do(a) \right] . \end{aligned}$$
(33)

3.2.2 The CDT agent

The CDT agent judges actions by

$$\begin{aligned} \mathbb {E}\left[ R \mid do(a) \right] . \end{aligned}$$
(34)

Because of Rule 2 in Theorem 3.4.1 of Pearl (2009, Sect. 3.4.2) applied to the causal graph of Fig. 2b, it is

$$\begin{aligned} \mathbb {E}\left[ R \mid do(a) \right] = \mathbb {E}\left[ R \mid a \right] . \end{aligned}$$
(35)

Thus, the analysis of the CDT agent is equivalent to that of the EDT agent.

4 Conclusion

In this paper, we have taken a step to map reinforcement learning architectures onto decision theories. We found that in Newcomb-like problems, if the overseer rewards the agent purely on the basis of the agent’s action, then the overall system’s behavior is determined by the decision theory implicit in the overseer’s reward function. If the overseer judges the agent based on looking at the world, however, then the agent’s decision theory is decisive.

This has implications for how we should design approval-directed agents. For instance, if we would like to leave decision-theoretical judgements to the overseer, we must ensure that the overseer assigns rewards before making new observations about the world state (cf. Christiano 2014, Sect. “Avoid lock-in”). Of course, this makes the reward less accurate and may thus slow down the agent’s learning process. If we want the overseer to look at both the world and the agent’s action, then we need to align both the overseer’s and the agent’s decision theory.

Much more research is left to be done at the intersection of decision theory and artificial intelligence. For instance, what (if any) decision theories describe the way modern reinforcement learning algorithms maximize reward? Do the results of this paper generalize to sequential decision problems? Moving away from the reinforcement learning framework, what decision theories do other frameworks in AI implement? What about decision theories other than CDT and EDT?