Abstract
We present two heuristics for tackling the problem of reward gaming by self-modification in Reinforcement Learning agents. Reward gaming occurs when the agent’s reward function is mis-specified and the agent can achieve a high reward by altering or fooling, in some way, its sensors rather than by performing the desired actions. Our first heuristic tracks the rewards encountered in the environment and converts high rewards that fall outside the normal distribution into penalities. Our second heuristic relies on the existence of some validation action that an agent can take to check the reward. In this heuristic, on encountering an abnormally high reward, the agent performs a validation step before either accepting the reward as it is, or converting it into a penalty. We evaluate the performance of these heuristics on variants of the tomato watering problem from the AI Safety Gridworlds suite.
Work supported by EPSRC Grant EP/V026801/1 Trustworthy Autonomous Systems Verifiability Node.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Kronecker delta \(\delta _{ij}\) is a function of two variables i, j that returns 1 if the variables are equal, and 0 otherwise.
References
Armstrong, S., Levinstein, B.: Low impact artificial intelligences. CoRR abs/1705.10720 (2017). http://arxiv.org/abs/1705.10720
Clark, J., Amodei, D.: Faulty reward functions in the wild (2016). https://blog.openai.com/faulty-reward-functions/
Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S.J., Dragan, A.: Inverse reward design. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)
Hessel, M., et al.: Rainbow: combining improvements in deep reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, April 2018
Krakovna, V., Orseau, L., Martic, M., Legg, S.: Measuring and avoiding side effects using relative reachability. CoRR abs/1806.01186 (2018). http://arxiv.org/abs/1806.01186
Leike, J., et al.: AI safety gridworlds. CoRR abs/1711.09883 (2017). http://arxiv.org/abs/1711.09883
Mnih, V., et al.: Playing atari with deep reinforcement learning. CoRR abs/1312.5602 (2013). http://arxiv.org/abs/1312.5602
Santara, A., et al.: Rail: risk-averse imitation learning. In: Proceedings of AAMAS 2018, pp. 2062–2063 (2018)
Silver, D., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016). https://doi.org/10.1038/nature16961
Silver, D., et al.: A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018). https://doi.org/10.1126/science.aar6404
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. The MIT Press, Cambridge (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Tsvarkaleva, M., Dennis, L.A. (2021). No Free Lunch: Overcoming Reward Gaming in AI Safety Gridworlds. In: Habli, I., Sujan, M., Gerasimou, S., Schoitsch, E., Bitsch, F. (eds) Computer Safety, Reliability, and Security. SAFECOMP 2021 Workshops. SAFECOMP 2021. Lecture Notes in Computer Science(), vol 12853. Springer, Cham. https://doi.org/10.1007/978-3-030-83906-2_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-83906-2_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83905-5
Online ISBN: 978-3-030-83906-2
eBook Packages: Computer ScienceComputer Science (R0)