Approval-directed agency and the decision theory of Newcomb-like problems
- 264 Downloads
Decision theorists disagree about how instrumentally rational agents, i.e., agents trying to achieve some goal, should behave in so-called Newcomb-like problems, with the main contenders being causal and evidential decision theory. Since the main goal of artificial intelligence research is to create machines that make instrumentally rational decisions, the disagreement pertains to this field. In addition to the more philosophical question of what the right decision theory is, the goal of AI poses the question of how to implement any given decision theory in an AI. For example, how would one go about building an AI whose behavior matches evidential decision theory’s recommendations? Conversely, we can ask which decision theories (if any) describe the behavior of any existing AI design. In this paper, we study what decision theory an approval-directed agent, i.e., an agent whose goal it is to maximize the score it receives from an overseer, implements. If we assume that the overseer rewards the agent based on the expected value of some von Neumann–Morgenstern utility function, then such an approval-directed agent is guided by two decision theories: the one used by the agent to decide which action to choose in order to maximize the reward and the one used by the overseer to compute the expected utility of a chosen action. We show which of these two decision theories describes the agent’s behavior in which situations.
KeywordsReinforcement learning Causal decision theory Evidential decision theory Newcomb’s problem AI safety Philosophical foundations of AI
In decision theory, there is a large debate about how an instrumentally rational agent, i.e., an agent trying to achieve some goal or maximize some utility function, should decide in Newcomb’s problem (introduced by Nozick 1969) and variations thereof (a list is given by Ledwig 2000, pp. 80–87). Consequently, different normative theories of instrumental rationality have been developed. The best known ones are evidential (sometimes also called Bayesian) decision theory (EDT) (Ahmed 2014; Almond 2010; Price 1986; Horgan 1981) and causal decision theory (CDT) (Gibbard and Harper 1981; Joyce 1999; Lewis 1981; Skyrms 1982; Weirich 2016), but many have attempted to remediate what they view as failures of the two theories by proposing further alternatives (Spohn 2003, 2012; Poellinger 2013; Arntzenius 2008; Gustafsson 2011; Wedgwood 2013; Dohrn 2015; Price 2012; Soares and Levinstein 2017).
Because the main goal of artificial intelligence is to build machines that make instrumentally rational decisions (Russell and Norvig 2010, Sects. 1.1.4, 2.2; Legg and Hutter 2007; Doyle 1992), this normative disagreement has some bearing on how to build these machines (cf. Soares and Fallenstein 2014a, Sect. 2.2; Soares and Fallenstein 2014b, Sect. 1; Bostrom 2014b, Chap. 13, Sect. “Decision theory”). The differences between these decision theories are probably inconsequential in most situations (Ahmed 2014, Sect. 0.5, Chap. 4; Briggs 2017),1 but still matter in some (Ahmed 2014, Chap. 4–6; Soares 2014a; Bostrom 2014a). In fact, AI may expose the differences more often. For example, Newcomb’s problem and the prisoner’s dilemma with a replica (Kuhn 2017, Sect. 7) are easy to implement for agents with copyable source code (cf. Yudkowsky 2010 pp. 85ff. Soares and Fallenstein 2014b, Sect. 2; Soares 2014b; Cavalcanti 2010; Sect. 5). Indeed, the existence of many copies is the norm for (successful) software, including AI-based software. While copies of present-day software systems may only interact with each other in rigid, explicitly pre-programmed ways, future AI-based systems will make decisions in a more autonomous, flexible and goal-driven way. Overall, the decision theory of Newcomb-like scenarios is a central foundational issue which will plausibly become practically important in the longer term.
What decision theory do we want an AI to follow?
How could we implement such a decision theory in an AI? Or: How do decision theories and AI frameworks or architectures map onto each other?
Specifically, I would like to investigate how approval-directed agents behave in Newcomb-like problems. By an approval-directed agent, I mean an agent that is coupled with an overseer. After the agent has chosen an action, the overseer scores the agent for that action. Rather than, say, trying to bring about particular states in the environment, the agent chooses actions so as to maximize the score it receives from the overseer (cf. Christiano 2014). A model of approval-directed agency that allows us to describe Newcomb-like situations is described and discussed in Sect. 2.
Approval-directed agency is intended as a model of reinforcement learning agents (see Sutton and Barto 1998; Russell and Norvig 2010; Chaps. 17, 21, for introductions to reinforcement learning), for whom the reward function is analogous to the approval-directed agent’s overseer. Since reinforcement learning is such a general and commonly studied problem in artificial intelligence (Hutter 2005, e.g. Chap. 4.1.3; Russell and Norvig 2010, p. 831; Sutton and Barto 1998, Chap. 1), it is an especially attractive target for modeling.3 However, because decision theories are usually defined only for single decisions, we will only discuss single decisions whereas reinforcement learning is usually concerned with sequential interactions of agent and environment. However, this decision can also be a policy choice to model sequential decision problems.4 In addition to limiting our analysis to single decisions, we will not discuss the learning process and simply assume that the agent has already formed some model of the world.
We then show how the overall decision theory depends on these two potentially conflicting decision theories. If the overseer bases its expected value calculations on looking only at the world, then the agent’s decision theory is decisive. If the overseer bases its estimates only on the agent’s action, then the overseer’s decision (or perhaps rather action evaluation) theory is decisive.
2 Approval-directed agency
We first describe a model of approval-directed agency. To be able to apply both CDT and EDT, we will use causal models in Pearl’s (2009) sense. Consequently, we use Pearl’s do-calculus-based version of CDT (Pearl 2009, Chap. 4). We will, throughout this paper, assume that the agent has already formed a (potentially implicit) model of the world5—e.g., based on past interactions with the environment. Also, we will only consider single decisions rather than sequential problems of iterative interaction between agent and environment.
A causal model of such a one-shot Newcomb problem from the perspective of the approval-directed agent is given in Fig. 1. In this model, the agent decides to take some action A, which may causally affect some part of the environment history, i.e., the history of states, H. We will call that part of the history the agent’s causal future \(H_f\). Furthermore, the agent may be causally influenced by some other part of the environment history, which we will call the agent’s causal past \(H_p\). H may contain information other than \(H_f\) and \(H_p\), which we will assume to be independent of A.6 The overseer, physically realized by, e.g., some module physically attached to the agent or a human supervisor, observes the agent’s action and partially, via some percept O, the state of the world7. The overseer then calculates the reward R. To set proper incentives to the agent, we will assume the overseer to know not only the action and observation, but also everything that the agent knows (cf. Christiano 2016). The overseer may also have access to some additional piece of information \(E_r\) about the way the reward is to be calculated.8 Lastly, we assume that the sets of possible values of A, O and \(E_r\) are finite.
In principle, the overseer could reward the agent in all kinds of ways. E.g., it could reward the agent “deontologically” (Alexander and Moore 2016) for taking a particular action independently of the consequences of taking that action. In this paper, we will assume that the reward estimates the value of some von Neumann–Morgenstern utility function U that only depends on states of the world. I use the capital U to indicate that the utility function, too, is a random variable (in the Bayesian sense). For simplicity’s sake, we will, again, assume that the set of possible values of U is finite.
We will view U as representing the system designer’s preferences over world states.9 While other ways of assigning the reward are possible, this is certainly an attractive way of getting an approval-directed agent to achieve goals that we want it to achieve. After all, in real-world applications, we will usually care about the outcomes of the agent’s decisions, such as whether a car has reached its destination in time or whether a human has been hurt.
3 The conflict of the decision theories of agent and overseer
When viewed together with the overseer, our agent may now be seen as containing two decision theories, one for computing the reward and one in the algorithm that tries to find the action to maximize that reward. These decision theories may not always be the same. Given this potential discrepancy, the question is which of the two decision theories prevails, i.e., for which configurations of the two decision theories the overall agent acts like a CDT agent and for which it acts like an EDT agent w.r.t. U.
As it turns out, the answer to this question depends on the decision problem in question. In particular, it depends on whether the overseer updates its estimate of U(H) primarily based on the action taken by the agent or on its observation of the environment.
Of course, these are only the two extremes from the set of all possible situations. In real-world Newcomb-like scenarios, the overseer may also draw some information from both sources. Nonetheless, it seems useful to understand the extreme cases, as this may also help us understand mixed ones.
In the following subsections, we will show that in the first type, the decision theory of the agent is decisive, whereas in the second type, the overseer’s decision theory is15. Roughly, the reason for that is the following: As noted earlier, the reward in the first type depends directly on U(H). Thus, the agent will try to maximize U(H) according to its own decision theory. In the second type, the overseer takes the agent’s action a and then considers what either a or do(a) says about U(H). Thus, the agent has to pay careful attention to whether the overseer uses EDT’s or CDT’s expected value.
An overview of the results of the calculations in Sect. 3
Type of Newcomb problem
3.1 First type
3.1.1 The EDT agent
Because the overseer knows more than the agent, we will need this rule in all of the following derivations. Its application makes it hard to generalize these results to other decision theories, since LTE does not apply if the two decision theories do not both compute a form of expected utility.
3.1.2 The CDT agent
3.2 Second type
3.2.1 The EDT agent
3.2.2 The CDT agent
In this paper, we have taken a step to map reinforcement learning architectures onto decision theories. We found that in Newcomb-like problems, if the overseer rewards the agent purely on the basis of the agent’s action, then the overall system’s behavior is determined by the decision theory implicit in the overseer’s reward function. If the overseer judges the agent based on looking at the world, however, then the agent’s decision theory is decisive.
This has implications for how we should design approval-directed agents. For instance, if we would like to leave decision-theoretical judgements to the overseer, we must ensure that the overseer assigns rewards before making new observations about the world state (cf. Christiano 2014, Sect. “Avoid lock-in”). Of course, this makes the reward less accurate and may thus slow down the agent’s learning process. If we want the overseer to look at both the world and the agent’s action, then we need to align both the overseer’s and the agent’s decision theory.
Much more research is left to be done at the intersection of decision theory and artificial intelligence. For instance, what (if any) decision theories describe the way modern reinforcement learning algorithms maximize reward? Do the results of this paper generalize to sequential decision problems? Moving away from the reinforcement learning framework, what decision theories do other frameworks in AI implement? What about decision theories other than CDT and EDT?
Of course, the existing literature asks about the right decision theory proper. The answer to that question might differ from the answer to the AI-specific question (cf. Kumar 2017; Treutlein 2018). After all, even if we have identified the right decision theory for ourselves, we may want to implement a different decision theory in an AI. One reason could be that the main contenders are not self-recommending—it has been pointed out that EDT and CDT both recommend to self-modify into slightly different decision theories (Meacham 2010; Soares and Fallenstein 2014b, Sect. 3; Yudkowsky 2010, Sect. 2; Greene 2018) . The same arguments imply that even if one is convinced of CDT or EDT one would not want the AI to use CDT and EDT. That said, one could also leave the self-modification to the AI.
Reinforcement learning and approval-directed agency are also common outside of artificial intelligence. For example, Achen and Bartels (2016, Chap. 4) review evidence which shows that electorates often vote retrospectively to punish or reward incumbents.
This is consistent with what reinforcement learning algorithms usually do—they choose policies rather than individual actions. This is because the utility of a single action usually cannot be evaluated without knowing how the agent will deal with situations that might arise as a result of taking that action.
When individual actions can be evaluated in isolation, the ex ante policy choice sometimes differs from the choice of individual actions (see the absent-minded driver, introduced by Piccione and Rubinstein 1997; cf. Aumann et al. 1997; the Newcomb-like scenarios discussed by, e.g., Hintze 2014; Soares and Levinstein 2017, Sect. 2; and the problems in anthropics discussed by Armstrong 2011). While it is rarely discussed in the debate between evidential and causal decision theorists, a few authors regard this discrepancy as crucial and have argued that a proper decision theory should be about optimal policy choices (e.g. Hintze 2014; Soares and Fallenstein 2014b, Sect. 2.1; Soares and Levinstein 2017, Sect. 2). However, this issue is beyond the scope of the present paper.
Further issues in sequential Newcomb-like problems are discussed by Everitt et al. (2015).
There is a broad philosophical literature on whether causal relationships exist and whether they can be inferred in cases where the agent is part of the environment. See, e.g., the edited volume by Price and Corry (2007).
For simplicity, we will ignore dependences not resulting from causation (Arntzenius 2010). For example, if you play against a copy, there is a logical dependence between your and your copy’s decision. Even if you know a set of nodes in the causal graph that d-separates your and your copy’s decision (e.g., if you know your common source code), the dependence persists. We exclude these dependences because such situations cannot be modeled by standard causal graphs.
However, we could adapt causal graphs to accomodate for these kinds of dependences. First, we could modify our definition of causality in such a way that dependence does imply causation, as has been proposed by Spohn (2003, 2012), Yudkowsky (2010) and others. For instance, we could model the dependence between the outputs of two instances of an algorithm by introducing a logical node as a common cause of the two. This logical node would then represent the output of the abstract algorithm that the two copies implement. While changes to the concept of causation may affect CDT’s implied behavior, the results from this paper can be directly transfered to such modifications.
Alternatively, we could extend causal graphs to also include non-causal dependences (cf. Poellinger 2013). Such extension necessitates a new CDT formalism, so the proofs from this paper do not directly transfer to this case. That said, I expect our results to generalize given that both EDT and CDT would probably treat non-causal dependences on the action just like they treat causal arrows directed toward the action.
Christiano (2014) does not define approval-directed agency formally, but judging from a comment he made at https://medium.com/paulfchristiano/i-agree-that-the-key-feature-of-approval-directed-agents-is-that-the-causal-picture-is-736b4474910e, he considers it crucial to his conception that the overseer only looks at the agent’s action and does not observe the action’s consequences (cf. the distinction introduced in Sect. 3).
One reason for the overseer to have access to such additional information is that some of the human supervisor’s values may not be expressible in a way that the approval-directed agent’s algorithm can utilize (cf. Muehlhauser and Helm 2012, Sects. 3, 4, 5.3).
At first sight this may be confusing to some readers, because in reinforcement learning, utility sometimes refers to expected cumulative reward (Russell and Norvig 2010, Chap. 17, 21), although others use the term value function instead (Sutton and Barto 1998, Sect. 3.7). Here, U does not refer to utility in that sense but in the decision-theoretical sense of representing intrinsic values. So, in the present case, we have two “layers” of goals: first, the agent maximizes the reward r. Second, the agent as incentivized by the overseer’s way of calculating rewards maximizes utility U(H).
One cause of confusion is that in model applications of reinforcement learning, the reward function possesses full knowledge of the world state and thus does not require the use of the expectation operator.
If the disagreement in Newcomb’s problem is to be about different theories of rational choice (EDT, CDT and so forth) rather than the predictive abilities of “the being”, Omega or the psychologist, then after requesting both boxes a proponent of two-boxing has to believe that she will probably receive only $1000. Causal and evidential decision theorists agree that regular conditional expectation is the correct way of updating one’s beliefs about the state of the world after an action has been taken (cf. the distinction between “acts” and “actions” in Pearl 2009 Sect. 4.1.1).
In reinforcement learning, some have proposed alternative optimization targets that incorporate, e.g., risk aversion (García and Fernández 2015, Sect. 3).
We give a brief justification of this claim. If all of a’s causal influence on H can be discerned from O, then, of course, a could still be diagnostically relevant for one’s estimate of U(H). The other direction is more complicated. The idea is that Eq. 5 can be true if the causal and non-causal implications of a exactly cancel each other out. An example is a version of Newcomb’s problem in which one-boxing ensures with certainty that both boxes contain the same amount of money. Then if O and \(E_r\) do not contain any information, the expected value of two-boxing and one-boxing is the same and so learning of the action is irrelevant for estimating U(H). However, two-boxing is causally better than one-boxing, so Eq. 6 is violated.
The dominance of the overseer’s decision theory in the second type of Newcomb’s problem is mentioned (though not proven) by Christiano (2014, Sect. “Avoid lock-in”).
I am indebted to Max Daniel, Johannes Treutlein, Tom Everitt, Lukas Gloor, Sören Mindermann, Brian Tomasik and Tobias Baumann for valuable comments and discussions.
- Albert, M., & Heiner, R. A. (2001). An indirect-evolution approach to Newcomb’s problem. CSLE discussion paper, no. 2001-01. https://www.econstor.eu/bitstream/10419/23110/1/2001-01_newc.pdf. Accessed 22 Feb 2019.
- Alexander, L., & Moore, M. (2016). Deontological ethics. In E. N. Zalta (Ed.), The Stanford encyclopedia of philosophy. Winter 2016. Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/archives/win2016/entries/ethics-deontological/. Accessed 22 Feb 2019.
- Almond, P. (2010). On causation and correlation part 1: Evidential decision theory is correct. https://casparoesterheld.files.wordpress.com/2016/12/almond_edt_1.pdf. Accessed 22 Feb 2019.
- Armstrong, S. (2011). Anthropic decision theory. Future of Humanity Institute. arXiv: 1110.6437.
- Arntzenius, F. (2010). Reichenbach’s common cause principle. In E. N. Zalta (Ed.), The Stanford en-cyclopedia of philosophy. Fall 2010. Metaphysics Research Lab, Stanford University.Google Scholar
- Billingsley, P. (1995). Probability and measure (3rd ed.). Hoboken: Wiley.Google Scholar
- Bostrom, N. (2014a). Hail mary, value porosity, and utility diversification. http://www.nickbostrom.com/papers/porosity.pdf. Accessed 22 Feb 2019.
- Bostrom, N. (2014b). Superintelligence. Paths, dangers, strategies (1st ed.). Oxford: Oxford University Press.Google Scholar
- Briggs, R. (2017). Real-life Newcomb problems? In Talk at the 1st workshop on decision theory & the future of artificial intelligence in Cambridge, UK.Google Scholar
- Christiano, P. (2014). Model-free decisions. https://ai-alignment.com/model-free-decisions-6e6609f5d99e. Accessed 22 Feb 2019.
- Christiano, P. (2016). Adequate oversight. https://ai-alignment.com/adequate-oversight-25fadf1edce9. Accessed 22 Feb 2019.
- Everitt, T., Leike, J., & Hutter, M. (2015). Sequential extensions of causal and evidential decision theory. In T. Walsh (Ed.), Algorithmic decision theory: 4th international conference, ADT 2015, Lexington, KY, USA, September 27–30, 2015, Proceedings (pp. 205–221). Springer. https://doi.org/10.1007/978-3-319-23114-3_13.
- Fisher, J. C. Disposition-based decision theory. https://casparoesterheld.files.wordpress.com/2019/02/dbdt.pdf. Accessed 22 Feb 2019.
- García, J., & Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16, 1437–1480.Google Scholar
- Gibbard, A., & Harper, W. L. (1981). Counterfactuals and two kinds of expected utility. In W. L. Harper, R. Stalnaker, & G. Pearce (Eds.), Ifs. Conditionals, belief, decision, chance and time (Vol. 15). The University of Western Ontario Series in Philosophy of Science. A series of books in philosophy of science, methodology, epistemology, logic, history of science, and related fields (pp. 153–190). Springer. https://doi.org/10.1007/978-94-009-9117-0_8.
- Greene, P. (2018). Success-first decision theories. In A. Ahmed (Ed.), Newcomb’s problem. Classic Philosophical Arguments. Cambridge University Press. https://doi.org/10.1017/9781316847893.007.
- Hintze, D. (2014). Problem class dominance in predictive dilemmas. http://intelligence.org/files/ProblemClassDominance.pdf. Accessed 22 Feb 2019.
- Hutter, M. (2005). Universal artificial intelligence. sequential decision based on algorithmic probability. In W. Brauer, G. Rozen-berg, & A. Salomaa (Eds.), Texts in theoretical computer science. Springer.Google Scholar
- Kuhn, S. (2017). Prisoner’s dilemma. In E. N. Zalta (Ed.), The Stanford encyclopedia of philosophy. Spring 2017. Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/archives/spr2017/entries/prisoner-dilemma/. Accessed 22 Feb 2019.
- Kumar, R. (2017). New work for decision theorists. In Talk at the 1st workshop on decision theory & the future of artificial intelligence in Cambridge, UK. Google Scholar
- Ledwig, M. (2000). Newcomb’s problem. Ph.D. thesis, University of Constance. https://kops.uni-konstanz.de/bitstream/handle/123456789/3451/ledwig.pdf. Accessed 22 Feb 2019.
- Mayer, D., Feldmaier, J., & Shen, H. (2016). Reinforcement learning in conflicting environments for autonomous vehicles. In International workshop on robotics in the 21st century: Challenges and promises. arXiv: 1610.07089.
- Muehlhauser, L., & Helm, L. (2012). Intelligence explosion and machine ethics. Machine Intelligence Research Institute. https://intelligence.org/files/IE-ME.pdf. Accessed 22 Feb 2019.
- Oesterheld, C. (2018a). Doing what has worked well in the past leads to evidential decision theory. https://casparoesterheld.files.wordpress.com/2018/01/learning-dt.pdf. Accessed 22 Feb 2019.
- Oesterheld, C. (2018b). Newcomb’s problem, the Prisoner’s dilemma and large universes: A consideration for consequentialists. In Talk at the 15th conference of the international society for utilitarian studies. Karlsruhe Institute of Technology (KIT), July 24–26, 2018.Google Scholar
- Poellinger, R. (2013). Unboxing the concepts in Newcomb’s paradox: Causation, prediction, decision. http://philsci-archive.pitt.edu/9887/7/newcomb_in_ckps.pdf. Accessed 22 Feb 2019.
- Price, H., & Corry, R. (Eds.). (2007). Causation, physics, and the constitution of reality: Russell’s republic revisited. Oxford: Oxford University Press.Google Scholar
- Ross, S. M. (2007). Introduction to probability models (9th ed.). Cambridge: Academic Press.Google Scholar
- Russell, S., & Norvig, P. (2010). Artificial intelligence. A modern approach (3rd ed.). London: Pearson Education, Inc.Google Scholar
- Soares, N. (2014a). Newcomblike problems are the norm. http://mindingourway.com/newcomblike-problems-are-the-norm/. Accessed 22 Feb 2019.
- Soares, N. (2014b). Why Ain’t you rich? https://intelligence.org/2014/10/07/nate-soares-talk-aint-rich/. Accessed 22 Feb 2019.
- Soares, N, & Fallenstein, B. (2014a). Aligning superintelligence with human interests: A technical research agenda. Technical report. 2014-8. Machine Intelligence Research Institute. https://intelligence.org/files/TechnicalAgenda.pdf. Accessed 22 Feb 2019.
- Soares, N, & Fallenstein, B. (2014b). Toward idealized decision theory. Technical report 2014-7. Machine Intelligence Research Institute. arXiv: 1507.01986.
- Soares, N., & Levinstein, B. A. (2017). Cheating death in damascus. In Formal epistemology workshop (FEW) 2017. University of Washington, Seattle, USA. https://intelligence.org/files/DeathInDamascus.pdf. Accessed 22 Feb 2019.
- Sorg, J. D. (2011). The optimal reward problem: Designing effective reward for bounded agents. PhD thesis, University of Michigan. https://deepblue.lib.umich.edu/bitstream/handle/2027.42/89705/jdsorg_1.pdf. Accessed 22 Feb 2019.
- Spohn, W. (2003). Dependency equilibria and the causal structure of decision and game situation. Homo Oeconomicus, 20, 195–255.Google Scholar
- Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge: MIT Press.Google Scholar
- Treutlein, J. (2018). How the decision theory of Newcomb like problems differs between humans and machines. In Talk at the 2nd workshop on decision theory & the future of artificial intelligence in Munich, Germany.Google Scholar
- Treutlein, J., & Oesterheld, C. (2017). A wager for evidential decision theory. Unpublished manuscript.Google Scholar
- Weirich, P. (2016). Causal decision theory. In The Stanford encyclopedia of philosophy. Spring 2016.Google Scholar
- Yudkowsky, E. (2010). Timeless decision theory. The Singularity Institute. http://intelligence.org/files/TDT.pdf. Accessed 22 Feb 2019.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.