# A novel multi-step reinforcement learning method for solving *reward hacking*

- 4 Downloads

## Abstract

Reinforcement learning with appropriately designed reward signal could be used to solve many sequential learning problems. However, in practice, the reinforcement learning algorithms could be broken in unexpected, counterintuitive ways. One of the failure modes is *reward hacking* which usually happens when a reward function makes the agent obtain high return in an unexpected way. This unexpected way may subvert the designer’s intentions and lead to accidents during training. In this paper, a new multi-step state-action value algorithm is proposed to solve the problem of *reward hacking*. Unlike traditional algorithms, the proposed method uses a new *return* function, which alters the discount of future rewards and no longer stresses the immediate reward as the main influence when selecting the current state action. The performance of the proposed method is evaluated on two games, Mappy and Mountain Car. The empirical results demonstrate that the proposed method can alleviate the negative impact of *reward hacking* and greatly improve the performance of reinforcement learning algorithm. Moreover, the results illustrate that the proposed method could also be applied to the continuous state space problem successfully.

## Keywords

Reinforcement learning Robotics*Reward hacking*Multi-step methods

## Notes

### Acknowledgment

The authors would like to thank the editors and reviewers for the time and effort spent in handling this work, and the detailed and constructive comments provided to further improve its presentation and quality. This work was supported in part by the National Natural Science Foundation of China under Grant 61836003, 61573150, 61573152 and 61633010.

## References

- 1.Amin K, Jiang N, Singh S (2017) Repeated inverse reinforcement learning. In: Advances in Neural Information Processing Systems (NIPS), pp 1815–1824Google Scholar
- 2.Amodei D, Olah C, Steinhardt J, Christiano P, Schulman J, Mané D (2016) Concrete problems in ai safety. arXiv:160606565
- 3.An Y, Ding S, Shi S, Li J (2018) Discrete space reinforcement learning algorithm based on support vector machine classification. Pattern Recogn Lett 111:30–35CrossRefGoogle Scholar
- 4.Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017) Deep reinforcement learning: a brief survey. IEEE Signal Proc Mag 34(6):26–38CrossRefGoogle Scholar
- 5.Aslund H, Mhamdi EME, Guerraoui R, Maurer A (2018) Virtuously safe reinforcement learning. arXiv:180511447
- 6.Bragg J, Habli I (2018) What is acceptably safe for reinforcement learning. In: International workshop on artificial intelligence safety engineeringGoogle Scholar
- 7.De Asis K, Hernandez-Garcia JF, Holland GZ, Sutton RS (2017) Multi-step reinforcement learning: A unifying algorithm. arXiv:170301327v1
- 8.Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput 12(1):219–245CrossRefGoogle Scholar
- 9.Everitt T, Krakovna V, Orseau L, Hutter M, Legg S (2017) Reinforcement learning with a corrupted reward channel. In: International joint conferences on artificial intelligence (IJCAI), pp 4705–4713Google Scholar
- 10.Fernandez-Gauna B, Osa JL, Graña M (2017) Experiments of conditioned reinforcement learning in continuous space control tasks. Neurocomputing 271:38–47Google Scholar
- 11.Garcia J, Femandez F (2015) A comprehensive survey on safe reinforcement learning. J Mach Learn Res 16:1437–1480Google Scholar
- 12.Hadfield-Menell D, Milli S, Abbeel P, Russell SJ, Dragan A (2017) Inverse reward design. In: Advances in neural information processing systems (NIPS), pp 6765–6774Google Scholar
- 13.Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar M, Silver D (2017) Rainbow: Combining improvements in deep reinforcement learning. arXiv:171002298
- 14.Horgan D, Quan J, Budden D, Barth-Maron G, Hessel M, Van Hasselt H, Silver D (2018) Distributed prioritized experience replay. arXiv:180300933
- 15.Jaakkola T, Jordan MI, Singh SP (1993) Convergence of stochastic iterative dynamic programming algorithms. Neural Comput 6(6):1185–1201CrossRefzbMATHGoogle Scholar
- 16.Laurent O, Stuart A (2016) Safely interruptible agents. In: Association for uncertainty in artificial intelligenceGoogle Scholar
- 17.Leike J, Martic M, Krakovna V, Ortega P A, Everitt T, Lefrancq A, Orseau L, Legg S (2017) Ai safety gridworlds. arXiv:171109883
- 18.Ludvig EA, Sutton RS, Kehoe EJ (2012) Evaluating the td model of classical conditioning. Learning & Behavior 40(3):305– 319CrossRefGoogle Scholar
- 19.Marco D (2009) Markov random processes are neither bandlimited nor recoverable from samples or after quantization. IEEE Trans Inf Theory 55(2):900–905MathSciNetCrossRefzbMATHGoogle Scholar
- 20.Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. In: Annual Conference on Neural Information Processing Systems (NIPS)Google Scholar
- 21.Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland A, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533CrossRefGoogle Scholar
- 22.Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning (ICML), pp 1928–1937Google Scholar
- 23.Moerland TM, Broekens J, Jonker CM (2018) Emotion in reinforcement learning agents and robots: a survey. Mach Learn 107(2):443–480MathSciNetCrossRefzbMATHGoogle Scholar
- 24.Murphy SA (2005) A generalization error for q-learning. Journal of Machine Learning Research Jmlr 6(3):1073MathSciNetzbMATHGoogle Scholar
- 25.Pakizeh E, Pedram M M, Palhang M (2015) Multi-criteria expertness based cooperative method for sarsa and eligibility trace algorithms. Appl Intell 43:487–498CrossRefGoogle Scholar
- 26.Pathak S, Pulina L, Tacchella A (2017) Verification and repair of control policies for safe reinforcement learning. Appl Intell 1:886–908Google Scholar
- 27.Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing 71(7-9):1180–1190CrossRefGoogle Scholar
- 28.Richard GF, Shlomo Z (2016) Safety in ai-hri: challenges complementing user experience quality. In: AAAI Conference on Artificial Intelligence(AAAIGoogle Scholar
- 29.Seijen HV, Mahmood AR, Pilarski PM, Machado MC, Sutton RS (2016) True online temporal-difference learning. J Mach Learn Res 17(1):5057–5096MathSciNetzbMATHGoogle Scholar
- 30.Singh S, Jaakkola T, Littman ML, Szepesvari C (2000) Convergence results for single-step on-policy reinforcement-learning algorithms. Mach Learn 38(3):287–308CrossRefzbMATHGoogle Scholar
- 31.Suri RE (2002) Td models of reward predictive responses in dopamine neurons. Neural Netw 15(4-6):523–533CrossRefGoogle Scholar
- 32.Sutton R, Barto A (2017) Introduction to rinforcement learning (2nd Edition, in preparation). MIT PressGoogle Scholar
- 33.Sutton RS (2016) Tile coding software – reference manual, version 3 beta. http://incompleteideas.net/tiles/tiles3.html
- 34.Van Seijen H, Van Hasselt H, Whiteson S, Wiering M (2009) A theoretical and empirical analysis of expected sarsa. In: Proceedings of the IEEE symposium on adaptive dynamic programming reinforcement learning, pp 177–184Google Scholar
- 35.Xu X, Zuo L, Huang Z (2014) Reinforcement learning algorithms with function approximation: recent advances and applications. Inf Sci 261:1–31MathSciNetCrossRefzbMATHGoogle Scholar
- 36.Zhao X, Ding S, An Y (2018) A new asynchronous architecture for tabular reinforcement learning algorithms. In: Proceedings of the 8th international conference on extreme learning machines, pp 172–180Google Scholar
- 37.Zhao X, Ding S, An Y, Jia W (2018) Asynchronous reinforcement learning algorithms for solving discrete space path planning problems. Appl Intell 48(12):4889–4904CrossRefGoogle Scholar