A novel multi-step reinforcement learning method for solving reward hacking

  • Yinlong Yuan
  • Zhu Liang YuEmail author
  • Zhenghui Gu
  • Xiaoyan Deng
  • Yuanqing Li


Reinforcement learning with appropriately designed reward signal could be used to solve many sequential learning problems. However, in practice, the reinforcement learning algorithms could be broken in unexpected, counterintuitive ways. One of the failure modes is reward hacking which usually happens when a reward function makes the agent obtain high return in an unexpected way. This unexpected way may subvert the designer’s intentions and lead to accidents during training. In this paper, a new multi-step state-action value algorithm is proposed to solve the problem of reward hacking. Unlike traditional algorithms, the proposed method uses a new return function, which alters the discount of future rewards and no longer stresses the immediate reward as the main influence when selecting the current state action. The performance of the proposed method is evaluated on two games, Mappy and Mountain Car. The empirical results demonstrate that the proposed method can alleviate the negative impact of reward hacking and greatly improve the performance of reinforcement learning algorithm. Moreover, the results illustrate that the proposed method could also be applied to the continuous state space problem successfully.


Reinforcement learning Robotics Reward hacking Multi-step methods 



The authors would like to thank the editors and reviewers for the time and effort spent in handling this work, and the detailed and constructive comments provided to further improve its presentation and quality. This work was supported in part by the National Natural Science Foundation of China under Grant 61836003, 61573150, 61573152 and 61633010.


  1. 1.
    Amin K, Jiang N, Singh S (2017) Repeated inverse reinforcement learning. In: Advances in Neural Information Processing Systems (NIPS), pp 1815–1824Google Scholar
  2. 2.
    Amodei D, Olah C, Steinhardt J, Christiano P, Schulman J, Mané D (2016) Concrete problems in ai safety. arXiv:160606565
  3. 3.
    An Y, Ding S, Shi S, Li J (2018) Discrete space reinforcement learning algorithm based on support vector machine classification. Pattern Recogn Lett 111:30–35CrossRefGoogle Scholar
  4. 4.
    Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017) Deep reinforcement learning: a brief survey. IEEE Signal Proc Mag 34(6):26–38CrossRefGoogle Scholar
  5. 5.
    Aslund H, Mhamdi EME, Guerraoui R, Maurer A (2018) Virtuously safe reinforcement learning. arXiv:180511447
  6. 6.
    Bragg J, Habli I (2018) What is acceptably safe for reinforcement learning. In: International workshop on artificial intelligence safety engineeringGoogle Scholar
  7. 7.
    De Asis K, Hernandez-Garcia JF, Holland GZ, Sutton RS (2017) Multi-step reinforcement learning: A unifying algorithm. arXiv:170301327v1
  8. 8.
    Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput 12(1):219–245CrossRefGoogle Scholar
  9. 9.
    Everitt T, Krakovna V, Orseau L, Hutter M, Legg S (2017) Reinforcement learning with a corrupted reward channel. In: International joint conferences on artificial intelligence (IJCAI), pp 4705–4713Google Scholar
  10. 10.
    Fernandez-Gauna B, Osa JL, Graña M (2017) Experiments of conditioned reinforcement learning in continuous space control tasks. Neurocomputing 271:38–47Google Scholar
  11. 11.
    Garcia J, Femandez F (2015) A comprehensive survey on safe reinforcement learning. J Mach Learn Res 16:1437–1480Google Scholar
  12. 12.
    Hadfield-Menell D, Milli S, Abbeel P, Russell SJ, Dragan A (2017) Inverse reward design. In: Advances in neural information processing systems (NIPS), pp 6765–6774Google Scholar
  13. 13.
    Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar M, Silver D (2017) Rainbow: Combining improvements in deep reinforcement learning. arXiv:171002298
  14. 14.
    Horgan D, Quan J, Budden D, Barth-Maron G, Hessel M, Van Hasselt H, Silver D (2018) Distributed prioritized experience replay. arXiv:180300933
  15. 15.
    Jaakkola T, Jordan MI, Singh SP (1993) Convergence of stochastic iterative dynamic programming algorithms. Neural Comput 6(6):1185–1201CrossRefzbMATHGoogle Scholar
  16. 16.
    Laurent O, Stuart A (2016) Safely interruptible agents. In: Association for uncertainty in artificial intelligenceGoogle Scholar
  17. 17.
    Leike J, Martic M, Krakovna V, Ortega P A, Everitt T, Lefrancq A, Orseau L, Legg S (2017) Ai safety gridworlds. arXiv:171109883
  18. 18.
    Ludvig EA, Sutton RS, Kehoe EJ (2012) Evaluating the td model of classical conditioning. Learning & Behavior 40(3):305– 319CrossRefGoogle Scholar
  19. 19.
    Marco D (2009) Markov random processes are neither bandlimited nor recoverable from samples or after quantization. IEEE Trans Inf Theory 55(2):900–905MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. In: Annual Conference on Neural Information Processing Systems (NIPS)Google Scholar
  21. 21.
    Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland A, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533CrossRefGoogle Scholar
  22. 22.
    Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning (ICML), pp 1928–1937Google Scholar
  23. 23.
    Moerland TM, Broekens J, Jonker CM (2018) Emotion in reinforcement learning agents and robots: a survey. Mach Learn 107(2):443–480MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Murphy SA (2005) A generalization error for q-learning. Journal of Machine Learning Research Jmlr 6(3):1073MathSciNetzbMATHGoogle Scholar
  25. 25.
    Pakizeh E, Pedram M M, Palhang M (2015) Multi-criteria expertness based cooperative method for sarsa and eligibility trace algorithms. Appl Intell 43:487–498CrossRefGoogle Scholar
  26. 26.
    Pathak S, Pulina L, Tacchella A (2017) Verification and repair of control policies for safe reinforcement learning. Appl Intell 1:886–908Google Scholar
  27. 27.
    Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing 71(7-9):1180–1190CrossRefGoogle Scholar
  28. 28.
    Richard GF, Shlomo Z (2016) Safety in ai-hri: challenges complementing user experience quality. In: AAAI Conference on Artificial Intelligence(AAAIGoogle Scholar
  29. 29.
    Seijen HV, Mahmood AR, Pilarski PM, Machado MC, Sutton RS (2016) True online temporal-difference learning. J Mach Learn Res 17(1):5057–5096MathSciNetzbMATHGoogle Scholar
  30. 30.
    Singh S, Jaakkola T, Littman ML, Szepesvari C (2000) Convergence results for single-step on-policy reinforcement-learning algorithms. Mach Learn 38(3):287–308CrossRefzbMATHGoogle Scholar
  31. 31.
    Suri RE (2002) Td models of reward predictive responses in dopamine neurons. Neural Netw 15(4-6):523–533CrossRefGoogle Scholar
  32. 32.
    Sutton R, Barto A (2017) Introduction to rinforcement learning (2nd Edition, in preparation). MIT PressGoogle Scholar
  33. 33.
    Sutton RS (2016) Tile coding software – reference manual, version 3 beta.
  34. 34.
    Van Seijen H, Van Hasselt H, Whiteson S, Wiering M (2009) A theoretical and empirical analysis of expected sarsa. In: Proceedings of the IEEE symposium on adaptive dynamic programming reinforcement learning, pp 177–184Google Scholar
  35. 35.
    Xu X, Zuo L, Huang Z (2014) Reinforcement learning algorithms with function approximation: recent advances and applications. Inf Sci 261:1–31MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Zhao X, Ding S, An Y (2018) A new asynchronous architecture for tabular reinforcement learning algorithms. In: Proceedings of the 8th international conference on extreme learning machines, pp 172–180Google Scholar
  37. 37.
    Zhao X, Ding S, An Y, Jia W (2018) Asynchronous reinforcement learning algorithms for solving discrete space path planning problems. Appl Intell 48(12):4889–4904CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Yinlong Yuan
    • 1
  • Zhu Liang Yu
    • 1
    Email author
  • Zhenghui Gu
    • 1
  • Xiaoyan Deng
    • 1
  • Yuanqing Li
    • 1
  1. 1.College of Automation Science and EngineeringSouth China University of TechnologyGuangzhouChina

Personalised recommendations