Approval-directed agency and the decision theory of Newcomb-like problems

Decision theorists disagree about how instrumentally rational agents, i.e., agents trying to achieve some goal, should behave in so-called Newcomb-like problems, with the main contenders being causal and evidential decision theory. Since the main goal of artificial intelligence research is to create machines that make instrumentally rational decisions, the disagreement pertains to this field. In addition to the more philosophical question of what the right decision theory is, the goal of AI poses the question of how to implement any given decision theory in an AI. For example, how would one go about building an AI whose behavior matches evidential decision theory’s recommendations? Conversely, we can ask which decision theories (if any) describe the behavior of any existing AI design. In this paper, we study what decision theory an approval-directed agent, i.e., an agent whose goal it is to maximize the score it receives from an overseer, implements. If we assume that the overseer rewards the agent based on the expected value of some von Neumann–Morgenstern utility function, then such an approval-directed agent is guided by two decision theories: the one used by the agent to decide which action to choose in order to maximize the reward and the one used by the overseer to compute the expected utility of a chosen action. We show which of these two decision theories describes the agent’s behavior in which situations.


Introduction
In decision theory, there is a large debate about how an instrumentally rational agent, i.e., an agent trying to achieve some goal or maximize some utility function, should decide in Newcomb's problem (introduced by Nozick 1969) and variations thereof (a list is given by Ledwig 2000, pp. 80-87).Consequently, different normative theories of instrumental rationality have been developed.The best known ones are evidential (sometimes also called Bayesian) decision theory (EDT) (Ahmed 2014;Almond 2010;Price 1986;Horgan 1981) and causal decision theory (CDT) (Gibbard and Harper 1981;Joyce 1999;Lewis 1981;Skyrms 1982;Weirich 2016), but many have attempted to remediate what they view as failures of the two theories by proposing further alternatives (Spohn 2003(Spohn , 2012;;Poellinger 2013;Arntzenius 2008;Gustafsson 2011;Wedgwood 2013;Dohrn 2015;Price 2012;Soares and Levinstein 2017).
Because the main goal of artificial intelligence is to build machines that make instrumentally rational decisions (Russell and Norvig 2010, Sects. 1.1.4, 2.2;Legg and Hutter 2007;Doyle 1992), this normative disagreement has some bearing on how to build these machines (cf.Soares and Fallenstein 2014a, Sect.2.2; Soares and Fallenstein 2014b, Sect. 1;Bostrom 2014b, Chap. 13, Sect. "Decision theory").The differences between these decision theories are probably inconsequential in most situations (Ahmed 2014, Sect. 0.5, Chap. 4; Briggs 2017),1 but still matter in some (Ahmed 2014, Chap. 4-6;Soares 2014a;Bostrom 2014a).In fact, AI may expose the differences more often.For example, Newcomb's problem and the prisoner's dilemma with a replica (Kuhn 2017, Sect. 7) are easy to implement for agents with copyable source code (cf.Yudkowsky 2010 pp.85ff.Soares and Fallenstein 2014b, Sect. 2;Soares 2014b;Cavalcanti 2010;Sect. 5).Indeed, the existence of many copies is the norm for (successful) software, including AI-based software.While copies of present-day software systems may only interact with each other in rigid, explicitly pre-programmed ways, future AI-based systems will make decisions in a more autonomous, flexible and goal-driven way.Overall, the decision theory of Newcomb-like scenarios is a central foundational issue which will plausibly become practically important in the longer term.
The problem for AI research posed by the disagreement among decision theorists can be divided into two questions: 1.What decision theory do we want an AI to follow? 2. How could we implement such a decision theory in an AI? Or: How do decision theories and AI frameworks or architectures map onto each other?Although it certainly requires further discussion, there already is a large literature related to the first question.2In this paper, I would thus like to draw attention to the second question.
Specifically, I would like to investigate how approval-directed agents behave in Newcomb-like problems.By an approval-directed agent, I mean an agent that is coupled with an overseer.After the agent has chosen an action, the overseer scores the agent for that action.Rather than, say, trying to bring about particular states in the environment, the agent chooses actions so as to maximize the score it receives from the overseer (cf.Christiano 2014).A model of approval-directed agency that allows us to describe Newcomb-like situations is described and discussed in Sect. 2.
Approval-directed agency is intended as a model of reinforcement learning agents (see Sutton and Barto 1998;Russell and Norvig 2010;Chaps. 17, 21, for introductions to reinforcement learning), for whom the reward function is analogous to the approval-directed agent's overseer.Since reinforcement learning is such a general and commonly studied problem in artificial intelligence (Hutter 2005, e.g. Chap. 4.1.3;Russell and Norvig 2010, p. 831;Sutton and Barto 1998, Chap. 1), it is an especially attractive target for modeling. 3However, because decision theories are usually defined only for single decisions, we will only discuss single decisions whereas reinforcement learning is usually concerned with sequential interactions of agent and environment.However, this decision can also be a policy choice to model sequential decision problems. 4In addition to limiting our analysis to single decisions, we will not discuss the learning process and simply assume that the agent has already formed some model of the world.
If we assume that, after an action has been taken, the overseer rewards the agent based on the expected value of some von Neumann-Morgenstern utility function, the agent is implicitly driven by two decision theories: The overseer can use the regular conditional expectation or the causal expectation to estimate the value of its utility function; and the agent itself can follow CDT or EDT when maximizing the score it receives from the overseer (Sect.3).
We then show how the overall decision theory depends on these two potentially conflicting decision theories.If the overseer bases its expected value calculations on looking only at the world, then the agent's decision theory is decisive.If the overseer Footnote 2 continued (Meacham 2010;Soares and Fallenstein 2014b, Sect. 3;Yudkowsky 2010, Sect. 2;Greene 2018) .The same arguments imply that even if one is convinced of CDT or EDT one would not want the AI to use CDT and EDT.That said, one could also leave the self-modification to the AI. 3 Reinforcement learning and approval-directed agency are also common outside of artificial intelligence.For example, Achen and Bartels (2016, Chap. 4) review evidence which shows that electorates often vote retrospectively to punish or reward incumbents. 4This is consistent with what reinforcement learning algorithms usually do-they choose policies rather than individual actions.This is because the utility of a single action usually cannot be evaluated without knowing how the agent will deal with situations that might arise as a result of taking that action.When individual actions can be evaluated in isolation, the ex ante policy choice sometimes differs from the choice of individual actions (see the absent-minded driver, introduced by Piccione and Rubinstein 1997; cf.Aumann et al. 1997; the Newcomb-like scenarios discussed by, e.g., Hintze 2014; Soares and Levinstein 2017, Sect.2; and the problems in anthropics discussed by Armstrong 2011).While it is rarely discussed in the debate between evidential and causal decision theorists, a few authors regard this discrepancy as crucial and have argued that a proper decision theory should be about optimal policy choices (e.g.Hintze 2014; Soares and Fallenstein 2014b, Sect.2.1; Soares and Levinstein 2017, Sect.2).However, this issue is beyond the scope of the present paper.Further issues in sequential Newcomb-like problems are discussed by Everitt et al. (2015). 1 A causal model of an approval-directed agent in a Newcomb-like decision problem.A denotes the agent's action, H the environment history, O the observation on which the overseer bases the reward, R is that reward, and E r is information about the way the reward is computed that is only available to the overseer.The box is used to indicate that H includes the two random variables H p and H f .All of H may have a causal influence on O bases its estimates only on the agent's action, then the overseer's decision (or perhaps rather action evaluation) theory is decisive.

Approval-directed agency
We first describe a model of approval-directed agency.To be able to apply both CDT and EDT, we will use causal models in Pearl's (2009) sense.Consequently, we use Pearl's do-calculus-based version of CDT (Pearl 2009, Chap. 4).We will, throughout this paper, assume that the agent has already formed a (potentially implicit) model of the world5 -e.g., based on past interactions with the environment.Also, we will only consider single decisions rather than sequential problems of iterative interaction between agent and environment.
A causal model of such a one-shot Newcomb problem from the perspective of the approval-directed agent is given in Fig. 1.In this model, the agent decides to take some action A, which may causally affect some part of the environment history, i.e., the history of states, H .We will call that part of the history the agent's causal future H f .Furthermore, the agent may be causally influenced by some other part of the environment history, which we will call the agent's causal past H p .H may contain information other than H f and H p , which we will assume to be independent of A. 6 The overseer, physically realized by, e.g., some module physically attached to the agent or a human supervisor, observes the agent's action and partially, via some percept O, the state of the world 7 .The overseer then calculates the reward R. To set proper incentives to the agent, we will assume the overseer to know not only the action and observation, but also everything that the agent knows (cf.Christiano 2016).The overseer may also have access to some additional piece of information E r about the way the reward is to be calculated. 8Lastly, we assume that the sets of possible values of A, O and E r are finite.
In principle, the overseer could reward the agent in all kinds of ways.E.g., it could reward the agent "deontologically" (Alexander and Moore 2016) for taking a particular action independently of the consequences of taking that action.In this paper, we will assume that the reward estimates the value of some von Neumann-Morgenstern utility function U that only depends on states of the world.I use the capital U to indicate that the utility function, too, is a random variable (in the Bayesian sense).For simplicity's sake, we will, again, assume that the set of possible values of U is finite.
We will view U as representing the system designer's preferences over world states. 9While other ways of assigning the reward are possible, this is certainly an attractive way of getting an approval-directed agent to achieve goals that we want it to achieve.After all, in real-world applications, we will usually care about the outcomes of the agent's decisions, such as whether a car has reached its destination in time or whether a human has been hurt.
The standard way of estimating U (H ) (or any quantity for that matter) is the familiar conditional expectation.Thus, the overseer may compute the reward as (1) Footnote 6 continued know your common source code), the dependence persists.We exclude these dependences because such situations cannot be modeled by standard causal graphs.However, we could adapt causal graphs to accomodate for these kinds of dependences.First, we could modify our definition of causality in such a way that dependence does imply causation, as has been proposed by Spohn (2003Spohn ( , 2012)), Yudkowsky (2010) and others.For instance, we could model the dependence between the outputs of two instances of an algorithm by introducing a logical node as a common cause of the two.This logical node would then represent the output of the abstract algorithm that the two copies implement.While changes to the concept of causation may affect CDT's implied behavior, the results from this paper can be directly transfered to such modifications.Alternatively, we could extend causal graphs to also include non-causal dependences (cf.Poellinger 2013).Such extension necessitates a new CDT formalism, so the proofs from this paper do not directly transfer to this case.That said, I expect our results to generalize given that both EDT and CDT would probably treat non-causal dependences on the action just like they treat causal arrows directed toward the action. 7Christiano (2014) does not define approval-directed agency formally, but judging from a comment he made at https://medium.com/paulfchristiano/i-agree-that-the-key-feature-of-approval-directed-agents-isthat-the-causal-picture-is-736b4474910e, he considers it crucial to his conception that the overseer only looks at the agent's action and does not observe the action's consequences (cf. the distinction introduced in Sect.3). 8One reason for the overseer to have access to such additional information is that some of the human supervisor's values may not be expressible in a way that the approval-directed agent's algorithm can utilize (cf.Muehlhauser and Helm 2012, Sects. 3, 4, 5.3). 9Some have tried to modify the reward relative to the designer's preferences to make the reinforcement learning problem easier to solve (Sorg 2011), although Sutton and Barto (1998, Sect. 3.2) explicitly discourage such tricks in their reinforcement learning textbook.
where r , a, e r , and o are values of R, A, E r , and O, respectively.10A causal decision theorist overseer agrees that after an action a is taken the righthand side of Eq. 1 most accurately estimates how much utility is achieved.She merely thinks that this term should not be used to decide which action a to take in the first place.11However, this puts a causal decision theorist overseer in a peculiar situation.Whatever formula she uses to compute the reward will also be used by the rewardmaximizing agent to decide which action to take.A causal decision theorist overseer might therefore worry (rightfully, as we will see) that providing rewards according to Eq. 1 will make the agent EDT-ish.Hence, she either has to incorrectly estimate how much utility was achieved; or live with the agent using an-in her mind-incorrect way of weighing her options.If she prefers the latter, she would reward according to Eq. 1.But arguably getting the agent to choose correctly is the overseer's primary goal.Thus, she might prefer to compute the reward according to (2) Here, do(a) refers to Pearl's do-calculus, where conditioning on do(a) roughly means intervening from outside the causal model to set A to a.For an introduction to the do-calculus, see Pearl (2009).Although a causal decision theorist overseer may prefer computing rewards according to Eq. 1, we will from now on say "the overseer uses CDT" if rewards are computed according to Eq. 2 and "the overseer uses EDT" if rewards are calculated according to Eq. 1.
An approval-directed agent is characterized by maximizing the reward it receives from the overseer.12However, decision theory offers us, again, (at least) two different expected values, the regular expected value of EDT (3) and CDT's causal expected value We leave the interesting question of which (if any) decision theory describes the behavior of current reinforcement learning algorithms to future research 13 and in the following assume that the agent is known to implement either CDT or EDT.

The conflict of the decision theories of agent and overseer
When viewed together with the overseer, our agent may now be seen as containing two decision theories, one for computing the reward and one in the algorithm that tries to find the action to maximize that reward.These decision theories may not always be the same.Given this potential discrepancy, the question is which of the two decision theories prevails, i.e., for which configurations of the two decision theories the overall agent acts like a CDT agent and for which it acts like an EDT agent w.r.t.U .As it turns out, the answer to this question depends on the decision problem in question.In particular, it depends on whether the overseer updates its estimate of U (H ) primarily based on the action taken by the agent or on its observation of the environment.
For illustration, consider two versions of Newcomb's problem.In both versions, the predictor is equally reliable-e.g., correct with 90% probability-and the potential box contents are the same-e.g., the standard $1K and $1M.As usual, the content of the opaque box cannot be causally influenced by one's decision.In the first version, the overseer eventually sees the payoff, i.e., how much money the agent has made.In this case, as soon as the money is observed, the overseer's estimate of U (H ) becomes independent of the agent's action.More generally, O may tell the overseer so much about U (H ) that it becomes independent of A even if U (H ) is not yet fully observed.That is, (5) and for all e r , a and o.Note that neither of these two implies the other. 14Intuitively speaking, these two mean that the reward is ultimately determined by U (H ).
In the second version of Newcomb's problem, the monetary payoff is not observed but covertly invested into increasing the agent's utility function.Only the agent's choice can then inform the overseer about U (H ).Formally, it is both 13 For preliminary work on this question, see Mayer et al. (2016), Oesterheld (2018a) and perhaps Albert and Heiner (2001). 14We give a brief justification of this claim.If all of a's causal influence on H can be discerned from O, then, of course, a could still be diagnostically relevant for one's estimate of U (H ).The other direction is more complicated.The idea is that Eq. 5 can be true if the causal and non-causal implications of a exactly cancel each other out.An example is a version of Newcomb's problem in which one-boxing ensures with certainty that both boxes contain the same amount of money.Then if O and E r do not contain any information, the expected value of two-boxing and one-boxing is the same and so learning of the action is irrelevant for estimating U (H ).However, two-boxing is causally better than one-boxing, so Eq. 6 is violated.

Synthese
and Intuitively speaking, these two equations mean that the reward is not determined by U (H ) but by what the overseer believes U (H ) will be given a or do(a).Again, we assume that this is known to the agent.An example class of cases is that in which the agent's decisions are correlated with those of agents in far-away parts of the environment (cf.Treutlein and Oesterheld 2017;Oesterheld 2018b).The two versions are depicted in Fig. 2.
Of course, these are only the two extremes from the set of all possible situations.In real-world Newcomb-like scenarios, the overseer may also draw some information from both sources.Nonetheless, it seems useful to understand the extreme cases, as this may also help us understand mixed ones.
In the following subsections, we will show that in the first type, the decision theory of the agent is decisive, whereas in the second type, the overseer's decision theory is15 .Roughly, the reason for that is the following: As noted earlier, the reward in the first type depends directly on U (H ).Thus, the agent will try to maximize U (H ) according to its own decision theory.In the second type, the overseer takes the agent's action a and then considers what either a or do(a) says about U (H ).Thus, the agent has to pay careful attention to whether the overseer uses EDT's or CDT's expected value.
We prove this formally by considering all possible configurations of the type of the problem, the overseer's decision theory and the agent's decision theory.While we will limit our analysis to EDT and CDT, the results can easily be generalized to variants of these that arise from modifying the causal model or conditional credence distribution (e.g.Yudkowsky 2010; "Disposition-based decision theory"; Spohn 2012; Dohrn 2015).The analysis is summarized in Table 1.

The EDT agent
The EDT agent judges its action by If the overseer calculates regular conditional expectation, then it is where the last line is due to what is sometimes called the law of total expectation (LTE) or the tower rule (see, e.g., Ross 2007, Sect. 3.4;Billingsley 1995, Theorem 34.4).Intuitively, you cannot expect that gaining more evidence (i.e., E r and O in addition to a) moves your expectation of U (H ) into any particular direction.
Because the overseer knows more than the agent, we will need this rule in all of the following derivations.Its application makes it hard to generalize these results to other decision theories, since LTE does not apply if the two decision theories do not both compute a form of expected utility.
Equations 10 and 11 show that if the overseer computes regular expected value and the agent maximizes the reward according to EDT, then the agent as a whole maximizes U according to EDT.
If the overseer computes CDT's expected value, it is (16)

The CDT agent
The CDT agent judges its action by If the overseer uses regular expected value (EDT), then = eq.5 and 6 e r ,o Learning about an intervention do(a) cannot always be treated in the same way as learning about other events.Hence, the application of the law of total expectation is not straightforward.However, P(• | do(x)) is always a probability distribution.Because the law of total expectation applies to all probability distributions, it also applies to ones resulting from the application of the do-calculus.
If the overseer uses CDT's expected value, then (24)

The EDT agent
The EDT agent judges its actions by If the overseer is based on regular conditional expectation (EDT), then it is again (33)

The CDT agent
The CDT agent judges actions by Because of Rule 2 in Theorem 3.4.1 of Pearl (2009, Sect. 3.4.2) applied to the causal graph of Fig. 2b, it is Thus, the analysis of the CDT agent is equivalent to that of the EDT agent.

Conclusion
In this paper, we have taken a step to map reinforcement learning architectures onto decision theories.We found that in Newcomb-like problems, if the overseer rewards the agent purely on the basis of the agent's action, then the overall system's behavior is determined by the decision theory implicit in the overseer's reward function.If the overseer judges the agent based on looking at the world, however, then the agent's decision theory is decisive.This has implications for how we should design approval-directed agents.For instance, if we would like to leave decision-theoretical judgements to the overseer, we must ensure that the overseer assigns rewards before making new observations about the world state (cf.Christiano 2014, Sect. "Avoid lock-in").Of course, this makes the reward less accurate and may thus slow down the agent's learning process.If we want the overseer to look at both the world and the agent's action, then we need to align both the overseer's and the agent's decision theory.
Much more research is left to be done at the intersection of decision theory and artificial intelligence.For instance, what (if any) decision theories describe the way modern reinforcement learning algorithms maximize reward?Do the results of this paper generalize to sequential decision problems?Moving away from the reinforcement learning framework, what decision theories do other frameworks in AI implement?What about decision theories other than CDT and EDT?
The reward is computed based on ob-(b) The reward is computed based on observing the agent's action.

Fig. 2
Fig. 2 Two different ways in which the overseer can calculate the reward r , o | a) • E [U (H ) | e r , do(a), o] (13) = eq.5 and 6 e r ,o P(e r , o | a) • E [U (H ) | e r , a, o] (

Table 1
An overview of the results of the calculations in Sect. 3