A Game-Theoretic Analysis of the Off-Switch Game

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10414)


The off-switch game is a game theoretic model of a highly intelligent robot interacting with a human. In the original paper by Hadfield-Menell et al.  (2016b), the analysis is not fully game-theoretic as the human is modelled as an irrational player, and the robot’s best action is only calculated under unrealistic normality and soft-max assumptions. In this paper, we make the analysis fully game theoretic, by modelling the human as a rational player with a random utility function. As a consequence, we are able to easily calculate the robot’s best action for arbitrary belief and irrationality assumptions.



This work grew out of a MIRIx workshop, with Owen Cameron, John Aslanides, Huon Puertas also attending. Thanks to Amy Zhang for proof reading multiple drafts. This work was in part supported by ARC grant DP150104590.


  1. Allais, M.: Le comportement de l’homme rationnel devant le risque: critique des postulats et axiomes de l’école Américaine. Econometrica 21(4), 503–546 (1953). doi: 10.2307/1907921
  2. Armstrong, S.: Motivated value selection for artificial agents. In: Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 12–20 (2015)Google Scholar
  3. Armstrong, S.: Utility indifference. Technical report. Oxford University, pp. 1–5 (2010)Google Scholar
  4. Armstrong, S., Leike, J.: Towards interactive inverse reinforcement learning. In: NIPS Workshop (2016)Google Scholar
  5. Dewey, D.: Learning what to value. In: Artificial General Intelligence, vol. 6830, pp. 309–314 (2011). ISBN 978-3-642-22886-5. doi: 10.1007/978-3-642-22887-2. arXiv: 1402.5379
  6. Everitt, T., Filan, D., Daswani, M., Hutter, M.: Self-modification of policy and utility function in rational agents. In: Steunebrink, B., Wang, P., Goertzel, B. (eds.) AGI -2016. LNCS, vol. 9782, pp. 1–11. Springer, Cham (2016). doi: 10.1007/978-3-319-41649-6_1 Google Scholar
  7. Hadfield-Menell, D., et al.: Cooperative inverse reinforcementlearning (2016a). arXiv: 1606.03137
  8. Hadfield-Menell, D., et al.: The off-switch game 2008, pp. 1–11 (2016b). arXiv: 1611.08219
  9. Martin, J., Everitt, T., Hutter, M.: Death and suicide in universal artificial intelligence. In: Steunebrink, B., Wang, P., Goertzel, B. (eds.) AGI -2016. LNCS, vol. 9782, pp. 23–32. Springer, Cham (2016). doi: 10.1007/978-3-319-41649-6_3. arXiv: 1606.00652 Google Scholar
  10. Omohundro, S.M.: The basic AI drives. In: Wang, P., Goertzel, B., Franklin, S. (eds.) Artificial General Intelligence, vol. 171, pp. 483–493. IOS Press (2008)Google Scholar
  11. Orseau, L., Armstrong, S.: Safely interruptible agents. In: 32nd Conference on Uncertainty in Artificial Intelligence (2016)Google Scholar
  12. Rasmusen, E.: Games and Information, 2nd edn. Blackwell, Oxford (1994)Google Scholar
  13. Soares, N., Fallenstein, B.: A technical research agenda. Technical report. Machine Intelligence Research Institute (MIRI), pp. 1–14Google Scholar
  14. Soares, N., et al.: Corrigibility. In: AAAI Workshop on AI and Ethics, pp. 74–82 (2015)Google Scholar
  15. Von Neumann, J., Morgenstern, O.: Theory of Games and Economic Behavior. Princeton Classic Editions. Princeton University Press, Princeton (1947). ISBN 0691003629. doi: 10.1177/1468795X06065810. Lambert, S., Deuber, O. (eds.)
  16. Wiener, N.: Some moral and technical consequences of automation. Science 131(3410), 1355–1358 (1960). ISSN 0036–8075. doi: 10.1126/science.132.3429.741

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Australian National UniversityActonAustralia
  2. 2.Linköping UniversityLinköpingSweden

Personalised recommendations