Skip to main content

Avoiding Wireheading with Value Reinforcement Learning

  • Conference paper
  • First Online:
Artificial General Intelligence (AGI 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9782))

Included in the following conference series:

Abstract

How can we design good goals for arbitrarily intelligent agents? Reinforcement learning (RL) may seem like a natural approach. Unfortunately, RL does not work well for generally intelligent agents, as RL agents are incentivised to shortcut the reward sensor for maximum reward – the so-called wireheading problem. In this paper we suggest an alternative to RL called value reinforcement learning (VRL). In VRL, agents use the reward signal to learn a utility function. The VRL setup allows us to remove the incentive to wirehead by placing a constraint on the agent’s actions. The constraint is defined in terms of the agent’s belief distributions, and does not require an explicit specification of which actions constitute wireheading.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The difference between RL and utility agents is mirrored in the experience machine debate (Sinnott-Armstrong 2015, Sect. 3) initialised by Nozick (1974). Given the option to enter a machine that will offer you the most pleasant delusions, but make you useless to the ‘real world’, would you enter? An RL agent would enter, but a utility agent would not.

  2. 2.

    The wireheading problem addressed in this paper arises from agents subverting evidence or reward. A companion paper (Everitt et al. 2016) shows how to avoid the related problem of agents modifying themselves.

  3. 3.

    For the sequential case, we would have transition probabilities of the form \(B(s'\mid s,a)\) instead of \(B(s'\mid a)\), with s the current state and \(s'\) the next state.

  4. 4.

    The wireheading problem that the replacement gives rise to is explained in Sect. 4, and overcome by Definition 5 and Theorem 14 below.

  5. 5.

    Everitt and Hutter (2016, Appendix B) discuss how to design agents with consistent belief distributions.

  6. 6.

    In this analogy, a self-deluding action would be to decide to look inside a fridge while at the same time putting a picture of milk in front of my eyes.

  7. 7.

    Technically, it is possible that the agent self-deludes by a CP action. However, the agent has no incentive to do so, and inadvertent self-delusion is typically implausible.

References

  • Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning. In: ICML, pp. 1–8 (2004)

    Google Scholar 

  • Amin, K., Singh, S.: Towards resolving unidentifiability in inverse reinforcement learning (2016). http://arXiv.org/abs/1601.06569

  • Armstrong, S.: Motivated value selection for artificial agents. In: Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 12–20 (2015)

    Google Scholar 

  • Bostrom, N.: Hail mary, value porosity, and utility diversification. Technical report, Oxford University (2014a)

    Google Scholar 

  • Bostrom, N.: Superintelligence: Paths, Dangers, Strategies. Oxford University Press, New York (2014b)

    Google Scholar 

  • Dewey, D.: Learning what to value. In: Schmidhuber, J., Thórisson, K.R., Looks, M. (eds.) AGI 2011. LNCS, vol. 6830, pp. 309–314. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  • Evans, O., Stuhlmuller, A., Goodman, N.D.: Learning the preferences of ignorant, inconsistent agents. In: AAAI 2016 (2016)

    Google Scholar 

  • Everitt, T., Filan, D., Daswani, M., Hutter, M.: Self-modification of policy and utility function in rational agents. In: Steunebrink, B., et al. (eds.) AGI 2016. LNAI, vol. 9782, pp. 1–11. Springer, Heidelberg (2016). http://arXiv.org/abs/1605.03142

    Google Scholar 

  • Everitt, T., Hutter, M.: Avoiding wireheading with value reinforcement learning (2016). http://arXiv.org/abs/1605.03143

    Google Scholar 

  • Hibbard, B.: Model-based utility functions. J. Artif. General Intell. 3(1), 1–24 (2012)

    Article  Google Scholar 

  • Kurzweil, R.: The Singularity Is Near. Viking Press, New York (2005)

    Google Scholar 

  • Ng, A., Russell, S.: Algorithms for inverse reinforcement learning. In: ICML pp. 663–670 (2000)

    Google Scholar 

  • Nozick, R.: Anarchy, State, and Utopia. Basic Books, New York (1974)

    Google Scholar 

  • Omohundro, S.M.: The basic AI drives. In: AGI-08. vol. 171, pp. 483–493. IOS Press (2008)

    Google Scholar 

  • Ring, M., Orseau, L.: Delusion, survival, and intelligent agents. In: Schmidhuber, J., Thórisson, K.R., Looks, M. (eds.) AGI 2011. LNCS, vol. 6830, pp. 11–20. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  • Sezener, C.E.: Inferring human values for safe AGI design. In: Bieger, J., Goertzel, B., Potapov, A. (eds.) AGI 2015. LNCS, vol. 9205, pp. 152–155. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  • Sinnott-Armstrong, W.: Consequentialism. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy. Winter 2015 edn. (2015)

    Google Scholar 

  • Soares, N.: The value learning problem. Technical report, MIRI (2015)

    Google Scholar 

  • Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)

    Google Scholar 

Download references

Acknowledgements

We thank Jan Leike and Jarryd Martin for proof reading and giving valuable suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tom Everitt .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Everitt, T., Hutter, M. (2016). Avoiding Wireheading with Value Reinforcement Learning. In: Steunebrink, B., Wang, P., Goertzel, B. (eds) Artificial General Intelligence. AGI 2016. Lecture Notes in Computer Science(), vol 9782. Springer, Cham. https://doi.org/10.1007/978-3-319-41649-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41649-6_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41648-9

  • Online ISBN: 978-3-319-41649-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics