Abstract
Any agent that is part of the environment it interacts with and has versatile actuators (such as arms and fingers), will in principle have the ability to self-modify – for example by changing its own source code. As we continue to create more and more intelligent agents, chances increase that they will learn about this ability. The question is: will they want to use it? For example, highly intelligent systems may find ways to change their goals to something more easily achievable, thereby ‘escaping’ the control of their creators. In an important paper, Omohundro (2008) argued that goal preservation is a fundamental drive of any intelligent system, since a goal is more likely to be achieved if future versions of the agent strive towards the same goal. In this paper, we formalise this argument in general reinforcement learning, and explore situations where it fails. Our conclusion is that the self-modification possibility is harmless if and only if the value function of the agent anticipates the consequences of self-modifications and use the current utility function when evaluating the future.
Keywords
- Current Utility Function
- General Reinforcement Learning (GRL)
- Preservation Goals
- Action-percept Pair
- Orseau
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, access via your institution.
Buying options

Notes
- 1.
To fit the knowledge-seeking agent into our framework, our definition deviates slightly from Orseau (2014).
- 2.
In this paper, we only consider the possibility of the agent changing its utility function itself, not the possibility of someone else (like the creator of the agent) changing it back. See Orseau and Ring (2012) for a model where the environment can change the agent.
- 3.
Note that a policy argument to \(Q^\mathrm{{re}}\) would be superfluous, as the the action \(a_k\) determines the next step policy \(\pi _{k+1}\).
- 4.
Computer viruses are very simple forms of survival agents that can be hard to stop. More intelligent versions could turn out to be very problematic.
- 5.
Note, however, that our result says nothing about the agent modifying the chessboard program to give high reward even when the agent is not winning. Our result only shows that the agent does not change its utility function \(u_1\leadsto u_t\), but not that the agent refrains from changing the percept \(e_t\) that is the input to the utility function. Ring and Orseau (2011) develop a model of the latter possibility.
References
Bird, J., Layzell, P.: The evolved radio and its implications for modelling the evolution of novel sensors. In: CEC-02, pp. 1836–1841 (2002)
Bostrom, N.: Superintelligence: Paths, Dangers Strategies. Oxford University Press, Oxford (2014)
Dewey, D.: Learning what to value. In: Schmidhuber, J., Thórisson, K.R., Looks, M. (eds.) AGI 2011. LNCS, vol. 6830, pp. 309–314. Springer, Heidelberg (2011)
Everitt, T., Filan, D., Daswani, M., Hutter, M.: Self-modification of policy and utility function in rational agents. Technical report (2016). arXiv:1605.03142
Everitt, T., Hutter, M.: Avoiding wireheading with value reinforcement learning. In: Steunebrink, B., et al. (eds.) AGI 2016, LNAI 9782, pp. 12–22 (2016)
Hibbard, B.: Model-based utility functions. J. Artif. Gen. Intell. Res. 3(1), 1–24 (2012)
Hutter, M.: Universal Artificial Intelligence. Springer, Heidelberg (2005)
Hutter, M.: Extreme state aggregation beyond MDPs. In: Auer, P., Clark, A., Zeugmann, T., Zilles, S. (eds.) ALT 2014. LNCS, vol. 8776, pp. 185–199. Springer, Heidelberg (2014)
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1–2), 99–134 (1998)
Legg, S., Hutter, M.: Universal intelligence: a definition of machine intelligence. Mind. Mach. 17(4), 391–444 (2007)
Leike, J., Lattimore, T., Orseau, L., Hutter, M.: Thompson sampling is asymptotically optimal in general environments. In: UAI-16 (2016)
Mnih, V., Kavukcuoglu, K., Silver, D., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Omohundro, S.M.: The basic AI drives. In: AGI-08, pp. 483–493. IOS Press (2008)
Orseau, L.: Universal knowledge-seeking agents. TCS 519, 127–139 (2014)
Orseau, L., Ring, M.: Self-modification and mortality in artificial agents. In: Schmidhuber, J., Thórisson, K.R., Looks, M. (eds.) AGI 2011. LNCS, vol. 6830, pp. 1–10. Springer, Heidelberg (2011)
Orseau, L., Ring, M.: Space-time embedded intelligence. In: Bach, J., Goertzel, B., Iklé, M. (eds.) AGI 2012. LNCS, vol. 7716, pp. 209–218. Springer, Heidelberg (2012)
Ring, M., Orseau, L.: Delusion, survival, and intelligent agents. In: Schmidhuber, J., Thórisson, K.R., Looks, M. (eds.) AGI 2011. LNCS, vol. 6830, pp. 11–20. Springer, Heidelberg (2011)
Schmidhuber, J.: Gödel machines: fully self-referential optimal universal self-improvers. In: Goertzel, B., Pennachin, C. (eds.) AGI-07, pp. 199–226. Springer, Heidelberg (2007)
Silver, D., Huang, A., Maddison, C.J., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Soares, N.: The value learning problem. Technical report MIRI (2015)
Soares, N., Fallenstein, B., Yudkowsky, E., Armstrong, S.: Corrigibility. In: AAAI Workshop on AI and Ethics, pp. 74–82 (2015)
Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Yampolskiy, R.V.: Artificial Super Intelligence: A Futuristic Approach. Chapman and Hall/CRC, Boca Raton (2015)
Acknowledgements
This work grew out of a MIRIx workshop. We thank the (non-author) participants David Johnston and Samuel Rathmanner. We also thank John Aslanides, Jan Leike, and Laurent Orseau for reading drafts and providing valuable suggestions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Everitt, T., Filan, D., Daswani, M., Hutter, M. (2016). Self-Modification of Policy and Utility Function in Rational Agents. In: Steunebrink, B., Wang, P., Goertzel, B. (eds) Artificial General Intelligence. AGI 2016. Lecture Notes in Computer Science(), vol 9782. Springer, Cham. https://doi.org/10.1007/978-3-319-41649-6_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-41649-6_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41648-9
Online ISBN: 978-3-319-41649-6
eBook Packages: Computer ScienceComputer Science (R0)