Avoiding Wireheading with Value Reinforcement Learning

Everitt, Tom; Hutter, Marcus

doi:10.1007/978-3-319-41649-6_2

Tom Everitt¹⁶ &
Marcus Hutter¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9782))

Included in the following conference series:

International Conference on Artificial General Intelligence

1746 Accesses
7 Citations

Abstract

How can we design good goals for arbitrarily intelligent agents? Reinforcement learning (RL) may seem like a natural approach. Unfortunately, RL does not work well for generally intelligent agents, as RL agents are incentivised to shortcut the reward sensor for maximum reward – the so-called wireheading problem. In this paper we suggest an alternative to RL called value reinforcement learning (VRL). In VRL, agents use the reward signal to learn a utility function. The VRL setup allows us to remove the incentive to wirehead by placing a constraint on the agent’s actions. The constraint is defined in terms of the agent’s belief distributions, and does not require an explicit specification of which actions constitute wireheading.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The difference between RL and utility agents is mirrored in the experience machine debate (Sinnott-Armstrong 2015, Sect. 3) initialised by Nozick (1974). Given the option to enter a machine that will offer you the most pleasant delusions, but make you useless to the ‘real world’, would you enter? An RL agent would enter, but a utility agent would not.
2.
The wireheading problem addressed in this paper arises from agents subverting evidence or reward. A companion paper (Everitt et al. 2016) shows how to avoid the related problem of agents modifying themselves.
3.
For the sequential case, we would have transition probabilities of the form \(B(s'\mid s,a)\) instead of \(B(s'\mid a)\), with s the current state and \(s'\) the next state.
4.
The wireheading problem that the replacement gives rise to is explained in Sect. 4, and overcome by Definition 5 and Theorem 14 below.
5.
Everitt and Hutter (2016, Appendix B) discuss how to design agents with consistent belief distributions.
6.
In this analogy, a self-deluding action would be to decide to look inside a fridge while at the same time putting a picture of milk in front of my eyes.
7.
Technically, it is possible that the agent self-deludes by a CP action. However, the agent has no incentive to do so, and inadvertent self-delusion is typically implausible.

References

Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning. In: ICML, pp. 1–8 (2004)
Google Scholar
Amin, K., Singh, S.: Towards resolving unidentifiability in inverse reinforcement learning (2016). http://arXiv.org/abs/1601.06569
Armstrong, S.: Motivated value selection for artificial agents. In: Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 12–20 (2015)
Google Scholar
Bostrom, N.: Hail mary, value porosity, and utility diversification. Technical report, Oxford University (2014a)
Google Scholar
Bostrom, N.: Superintelligence: Paths, Dangers, Strategies. Oxford University Press, New York (2014b)
Google Scholar
Dewey, D.: Learning what to value. In: Schmidhuber, J., Thórisson, K.R., Looks, M. (eds.) AGI 2011. LNCS, vol. 6830, pp. 309–314. Springer, Heidelberg (2011)
Chapter Google Scholar
Evans, O., Stuhlmuller, A., Goodman, N.D.: Learning the preferences of ignorant, inconsistent agents. In: AAAI 2016 (2016)
Google Scholar
Everitt, T., Filan, D., Daswani, M., Hutter, M.: Self-modification of policy and utility function in rational agents. In: Steunebrink, B., et al. (eds.) AGI 2016. LNAI, vol. 9782, pp. 1–11. Springer, Heidelberg (2016). http://arXiv.org/abs/1605.03142
Google Scholar
Everitt, T., Hutter, M.: Avoiding wireheading with value reinforcement learning (2016). http://arXiv.org/abs/1605.03143
Google Scholar
Hibbard, B.: Model-based utility functions. J. Artif. General Intell. 3(1), 1–24 (2012)
Article Google Scholar
Kurzweil, R.: The Singularity Is Near. Viking Press, New York (2005)
Google Scholar
Ng, A., Russell, S.: Algorithms for inverse reinforcement learning. In: ICML pp. 663–670 (2000)
Google Scholar
Nozick, R.: Anarchy, State, and Utopia. Basic Books, New York (1974)
Google Scholar
Omohundro, S.M.: The basic AI drives. In: AGI-08. vol. 171, pp. 483–493. IOS Press (2008)
Google Scholar
Ring, M., Orseau, L.: Delusion, survival, and intelligent agents. In: Schmidhuber, J., Thórisson, K.R., Looks, M. (eds.) AGI 2011. LNCS, vol. 6830, pp. 11–20. Springer, Heidelberg (2011)
Chapter Google Scholar
Sezener, C.E.: Inferring human values for safe AGI design. In: Bieger, J., Goertzel, B., Potapov, A. (eds.) AGI 2015. LNCS, vol. 9205, pp. 152–155. Springer, Heidelberg (2015)
Chapter Google Scholar
Sinnott-Armstrong, W.: Consequentialism. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy. Winter 2015 edn. (2015)
Google Scholar
Soares, N.: The value learning problem. Technical report, MIRI (2015)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Google Scholar

Download references

Acknowledgements

We thank Jan Leike and Jarryd Martin for proof reading and giving valuable suggestions.

Author information

Authors and Affiliations

Australian National University, Canberra, Australia
Tom Everitt & Marcus Hutter

Authors

Tom Everitt
View author publications
You can also search for this author in PubMed Google Scholar
Marcus Hutter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tom Everitt .

Editor information

Editors and Affiliations

Galleria 1, IDSIA, Manno, Switzerland
Bas Steunebrink
Temple University, Phoenixville, Pennsylvania, USA
Pei Wang
Hong Kong Polytechnic University, Hong Kong, Hong Kong
Ben Goertzel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Everitt, T., Hutter, M. (2016). Avoiding Wireheading with Value Reinforcement Learning. In: Steunebrink, B., Wang, P., Goertzel, B. (eds) Artificial General Intelligence. AGI 2016. Lecture Notes in Computer Science(), vol 9782. Springer, Cham. https://doi.org/10.1007/978-3-319-41649-6_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-41649-6_2
Published: 25 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41648-9
Online ISBN: 978-3-319-41649-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics