Avoiding Wireheading with Value Reinforcement Learning

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9782)


How can we design good goals for arbitrarily intelligent agents? Reinforcement learning (RL) may seem like a natural approach. Unfortunately, RL does not work well for generally intelligent agents, as RL agents are incentivised to shortcut the reward sensor for maximum reward – the so-called wireheading problem. In this paper we suggest an alternative to RL called value reinforcement learning (VRL). In VRL, agents use the reward signal to learn a utility function. The VRL setup allows us to remove the incentive to wirehead by placing a constraint on the agent’s actions. The constraint is defined in terms of the agent’s belief distributions, and does not require an explicit specification of which actions constitute wireheading.



We thank Jan Leike and Jarryd Martin for proof reading and giving valuable suggestions.


  1. Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning. In: ICML, pp. 1–8 (2004)Google Scholar
  2. Amin, K., Singh, S.: Towards resolving unidentifiability in inverse reinforcement learning (2016).
  3. Armstrong, S.: Motivated value selection for artificial agents. In: Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 12–20 (2015)Google Scholar
  4. Bostrom, N.: Hail mary, value porosity, and utility diversification. Technical report, Oxford University (2014a)Google Scholar
  5. Bostrom, N.: Superintelligence: Paths, Dangers, Strategies. Oxford University Press, New York (2014b)Google Scholar
  6. Dewey, D.: Learning what to value. In: Schmidhuber, J., Thórisson, K.R., Looks, M. (eds.) AGI 2011. LNCS, vol. 6830, pp. 309–314. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  7. Evans, O., Stuhlmuller, A., Goodman, N.D.: Learning the preferences of ignorant, inconsistent agents. In: AAAI 2016 (2016)Google Scholar
  8. Everitt, T., Filan, D., Daswani, M., Hutter, M.: Self-modification of policy and utility function in rational agents. In: Steunebrink, B., et al. (eds.) AGI 2016. LNAI, vol. 9782, pp. 1–11. Springer, Heidelberg (2016). Google Scholar
  9. Everitt, T., Hutter, M.: Avoiding wireheading with value reinforcement learning (2016). Google Scholar
  10. Hibbard, B.: Model-based utility functions. J. Artif. General Intell. 3(1), 1–24 (2012)CrossRefGoogle Scholar
  11. Kurzweil, R.: The Singularity Is Near. Viking Press, New York (2005)Google Scholar
  12. Ng, A., Russell, S.: Algorithms for inverse reinforcement learning. In: ICML pp. 663–670 (2000)Google Scholar
  13. Nozick, R.: Anarchy, State, and Utopia. Basic Books, New York (1974)Google Scholar
  14. Omohundro, S.M.: The basic AI drives. In: AGI-08. vol. 171, pp. 483–493. IOS Press (2008)Google Scholar
  15. Ring, M., Orseau, L.: Delusion, survival, and intelligent agents. In: Schmidhuber, J., Thórisson, K.R., Looks, M. (eds.) AGI 2011. LNCS, vol. 6830, pp. 11–20. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  16. Sezener, C.E.: Inferring human values for safe AGI design. In: Bieger, J., Goertzel, B., Potapov, A. (eds.) AGI 2015. LNCS, vol. 9205, pp. 152–155. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  17. Sinnott-Armstrong, W.: Consequentialism. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy. Winter 2015 edn. (2015)Google Scholar
  18. Soares, N.: The value learning problem. Technical report, MIRI (2015)Google Scholar
  19. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Australian National UniversityCanberraAustralia

Personalised recommendations