Advertisement

Cluster Computing

, Volume 22, Supplement 1, pp 795–807 | Cite as

Residual Sarsa algorithm with function approximation

  • Fu Qiming
  • Hu Wen
  • Liu Quan
  • Luo Heng
  • Hu Lingyao
  • Chen JianpingEmail author
Article
  • 216 Downloads

Abstract

In this work, we proposed an efficient algorithm named the residual Sarsa algorithm with function approximation (FARS) to improve the performance of the traditional Sarsa algorithm, and we use the gradient-descent method to update the function parameter vector. In the learning process, the Bellman residual method is adopted to guarantee the convergence of the algorithm, and a new rule for updating vectors of action-value functions is adopted to solve unstable and slow convergence problems. To accelerate the convergence rate of the algorithm, we introduce a new factor, named the forgotten factor, which can help improve the robustness of the algorithm’s performance. Based on two classical reinforcement learning benchmark problems, the experimental results show that the FARS algorithm has better performance than other related reinforcement learning algorithms.

Keywords

Reinforcement learning Sarsa algorithm Function approximation Gradient descent Bellman residual 

Notes

Acknowledgements

This research was partially supported National Natural Science Foundation of China (61672371, 61602334, 61502329, 61502323, 61272005, 61303108, 61373094, 61472262), Natural Science Foundation of Jiangsu (BK20140283, BK2012616), High School Natural Foundation of Jiangsu (13KJB520020), Fundation of Ministry of Housing and Urban-Rural Development of the People’s Republic of China (2015-K1-047), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172014K04), Suzhou Industrial application of basic research program part (SYG201422). We declare that there is no conflict of interest regarding the publication of this article.

References

  1. 1.
    Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. MIT press, Cambridge (1998)zbMATHGoogle Scholar
  2. 2.
    Liu, Q., Fu, Q.M., Gong, S.R., Fu, Y.C., Cui, Z.M.: Reinforcement learning algorithm based on minimum state method and average reward. J. Commun. 32(1), 66–71 (2011)Google Scholar
  3. 3.
    Sutton, R.S.: Learning to predict by the method of temporal differences. Mach. Learn. 3, 9–44 (1988)Google Scholar
  4. 4.
    Go, C.K., Lao, B., Yoshimoto J., et al.: A reinforcement learning approach to the shepherding task using Sarsa. In: Proceedings of International Joint Conference on Neural Networks (IJCNN), Kuala Lumpur, Malaysia (2016)Google Scholar
  5. 5.
    Chettibi, S., Chikhi, S.: Dynamic fuzzy logic and reinforcement learning for adaptive energy efficient routing in mobile ad-hoc networks. Appl. Soft Comput. 38, 321–328 (2016)CrossRefGoogle Scholar
  6. 6.
    Ortiz, A., Al-Shatri, H., Li, X., et al.: Reinforcement learning for energy harvesting point-to-point communications. In: Proceedings of IEEE International Conference on Communications (ICC), Kuala Lumpur, Malaysia (2016)Google Scholar
  7. 7.
    Saadatjou, F., Derhami, V., Majd, V.: Balance of exploration and exploitation in deterministic and stochastic environment in reinforcement learning. In: Proceedings of the 11th Annual Computer Society of Iran Computer Conference, Tehran, Iran (2006)Google Scholar
  8. 8.
    Yen, G., Yang, F., Hickey, T.: Coordination of exploration and exploitation in a dynamic environment. Int. J. Smart Eng. Syst. Des. 4(3), 177–182 (2002)CrossRefGoogle Scholar
  9. 9.
    Derhami, V., Majd, V.J., Ahmadabadi, M.N.: Exploration and exploitation balance management in fuzzy reinforcement learning. Fuzzy Sets Syst. 161(4), 578–595 (2010)MathSciNetCrossRefGoogle Scholar
  10. 10.
    You, S.H., Liu, Q., Fu, Q.M., et al.: A Bayesian Sarsa learning algorithm with bandit-based method. In: Proceedings of International Conference on Neural Information Processing (2015)Google Scholar
  11. 11.
    Liu, Q., Li, J., Fu, Q.M.: A multiple-goal Sarsa(\(\lambda )\) algorithm based on lost reward of greatest mass. J. Electron. 41(8), 1469–1473 (2013)Google Scholar
  12. 12.
    Xiao, F., Liu, Q., Fu, Q.M.: Gradient descent Sarsa(\(\lambda )\) algorithm based on the adaptive potential function shaping reward mechanism. J. Commun. 1, 77–88 (2013)Google Scholar
  13. 13.
    Fu, Q.M., Liu, Q., You, S.H.: A novel fast Sarsa algorithm based on value function transfer. J. Electron. 42(11), 2157–2161 (2014)Google Scholar
  14. 14.
    Zhu, H., Zhu, F., Fu, Y., et al.: A kernel-based Sarsa(\(\lambda )\) algorithm with clustering-based sample sparsification. In: Proceedings of International Conference on Neural Information Processing, Kyoto, Japan (2016)Google Scholar
  15. 15.
    Antos, A., Szepesvari, C., Mounos, R.: Learning near-optimal polices with bellman-residual minimization based fitted policy iteration and a single sample path. Mach. Learn. 71(1), 89–129 (2008)CrossRefGoogle Scholar
  16. 16.
    Busoniu, L., Babuska, R., De Schutter, B., et al.: Reinforcement learning and dynamic programming using function approximators. CRC Press, New York (2010)Google Scholar
  17. 17.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 30th annual ACM symposium on Theory of computing, New York, USA (1998)Google Scholar
  18. 18.
    Geist, M., Pietquin, O.: Parametric value function approximation. In: Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, Paris, France (2011)Google Scholar
  19. 19.
    Akimoto, Y., Auger, A., Hansen N.: Comparison-based natural gradient optimization in high dimension. In: Proceedings of the Annual Conference on Genetic and Evolutionary Computation. Vancouver, Canada (2014)Google Scholar
  20. 20.
    Sutton, R.S., Maei, H.R., Szepesvári, C. et al.: A convergent O(n) Temporal-difference algorithm for Off-policy learning with linear function approximation. In: Proceedings of the Advances Neural Information Processing Systems, Vancouver, Canada (2009)Google Scholar
  21. 21.
    Sutton, R.S., Hamid, R.M., Precup, D.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of the 26th International Conference on Machine Learning, New York, USA (2009)Google Scholar
  22. 22.
    Maei, H.R, Szepesvari, C., Bhatnagar, S. et al.: Toward off-policy learning control with function approixamtion. In: Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel (2010)Google Scholar
  23. 23.
    Kalyanakrish, S., Stone, P.: Characterizing reinforcement learning methods through parameterized learning problems. Mach. Learn. 84(1–2), 205–247 (2011)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Jaśkowski, W., Szubert, M., Liskowski, P. et al.: High-dimensional function approximation for knowledge-free reinforcement learning: a case study in SZ-Tetris. In: Proceedings of the Annual Conference on Genetic and Evolutionary Computation, New York, USA (2015)Google Scholar
  25. 25.
    Van Seijen, H.: Effective multi-step temporal-difference learning for non-linear function approximation. arXiv preprint arXiv:1608.05151 (2016)
  26. 26.
    Veeriah, V., Van Seijen, H., Sutton, R.S.: Forward actor-Critic for nonlinear function approximation in reinforcement learning. In: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, São Paulo, Brazil (2017)Google Scholar
  27. 27.
    Singh, S., Jaakkola, T., Littman, M.L., et al.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000)CrossRefzbMATHGoogle Scholar
  28. 28.
    Barnard, E.: Temporal-difference methods and Markov models. IEEE Trans. Syst. Man Cybern. 23(2), 357–365 (1993)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  • Fu Qiming
    • 1
    • 2
    • 3
    • 5
  • Hu Wen
    • 1
    • 2
    • 3
  • Liu Quan
    • 4
    • 5
    • 6
  • Luo Heng
    • 1
    • 2
    • 3
  • Hu Lingyao
    • 1
    • 2
    • 3
  • Chen Jianping
    • 1
    • 2
    • 3
    Email author
  1. 1.Institute of Electronics and Information EngineeringSuzhou University of Science and TechnologySuzhouChina
  2. 2.Jiangsu Key Laboratory of Intelligent Building Energy EfficiencySuzhou University of Science and TechnologySuzhouChina
  3. 3.Suzhou Key Laboratory of Mobile Networking and Applied TechnologiesSuzhou University of Science and TechnologySuzhouChina
  4. 4.School of Computer Science and TechnologySoochow UniversitySuzhouChina
  5. 5.Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of EducationJilin UniversityChangchunChina
  6. 6.Collaborative Innovation Center of Novel Software Technology and IndustrializationNanjingChina

Personalised recommendations