Skip to main content

Towards AGI Agent Safety by Iteratively Improving the Utility Function

Part of the Lecture Notes in Computer Science book series (LNAI,volume 12177)

Abstract

While it is still unclear if agents with Artificial General Intelligence (AGI) could ever be built, we can already use mathematical models to investigate potential safety systems for these agents. We present work on an AGI safety layer that creates a special dedicated input terminal to support the iterative improvement of an AGI agent’s utility function. The humans who switched on the agent can use this terminal to close any loopholes that are discovered in the utility function’s encoding of agent goals and constraints, to direct the agent towards new goals, or to force the agent to switch itself off.

An AGI agent may develop the emergent incentive to manipulate the above utility function improvement process, for example by deceiving, restraining, or even attacking the humans involved. The safety layer will partially, and sometimes fully, suppress this dangerous incentive.

This paper generalizes earlier work on AGI emergency stop buttons. We aim to make the mathematical methods used to construct the layer more accessible, by applying them to an MDP model. We discuss two provable properties of the safety layer, identify still-open issues, and present ongoing work to map the layer to a Causal Influence Diagram (CID).

Keywords

  • AGI safety
  • Safety layer
  • Provable safety
  • Corrigibility

K. Holtman—Independent Researcher.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-52152-3_21
  • Chapter length: 11 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   79.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-52152-3
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   99.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.

References

  1. Armstrong, S.: Motivated value selection for artificial agents. In: Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)

    Google Scholar 

  2. Armstrong, S., O’Rourke, X.: ‘Indifference’ methods for managing agent rewards. arXiv:1712.06365 (2017)

  3. Boutilier, C., Dean, T., Hanks, S.: Decision-theoretic planning: structural assumptions and computational leverage. J. Artif. Int. Res. 11(1), 1–94 (1999)

    MathSciNet  MATH  Google Scholar 

  4. Carey, R., Langlois, E., Everitt, T., Legg, S.: The incentives that shape behaviour. arXiv:2001.07118 (2020)

  5. Everitt, T., Hutter, M.: Reward tampering problems and solutions in reinforcement learning: a causal influence diagram perspective. arXiv:1908.04734 (2019)

  6. Everitt, T., Kumar, R., Krakovna, V., Legg, S.: Modeling AGI safety frameworks with causal influence diagrams. arXiv:1906.08663 (2019)

  7. Hadfield-Menell, D., Dragan, A., Abbeel, P., Russell, S.: The off-switch game. In: Workshops at the Thirty-First AAAI Conference on Artificial Intelligence (2017)

    Google Scholar 

  8. Holtman, K.: Corrigibility with utility preservation. arXiv:1908.01695 (2019)

  9. Holtman, K.: Towards AGI agent safety by iteratively improving the utility function: proofs, models, and reality. Preprint on arXiv (2020)

    Google Scholar 

  10. Omohundro, S.M.: The basic AI drives. In: AGI, vol. 171, pp. 483–492 (2008)

    Google Scholar 

  11. Shachter, R., Heckerman, D.: Pearl causality and the value of control. In: Dechter, R., Geffner, H., Halpern, J.Y. (eds.) Heuristics, Probability, and Causality: A Tribute to Judea Pearl, pp. 431–447. College Publications, London (2010)

    Google Scholar 

  12. Soares, N., Fallenstein, B., Armstrong, S., Yudkowsky, E.: Corrigibility. In: Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)

    Google Scholar 

Download references

Acknowledgments

Thanks to Stuart Armstrong, Ryan Carey, Tom Everitt, and David Krueger for feedback on drafts of this paper, and to the anonymous reviewers for useful comments that led to improvements in the presentation.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Holtman, K. (2020). Towards AGI Agent Safety by Iteratively Improving the Utility Function. In: Goertzel, B., Panov, A., Potapov, A., Yampolskiy, R. (eds) Artificial General Intelligence. AGI 2020. Lecture Notes in Computer Science(), vol 12177. Springer, Cham. https://doi.org/10.1007/978-3-030-52152-3_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-52152-3_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-52151-6

  • Online ISBN: 978-3-030-52152-3

  • eBook Packages: Computer ScienceComputer Science (R0)