Abstract
This paper investigates reinforcement learning problems where a stochastic time delay is present in the reinforcement signal, but the delay is unknown to the learning agent. This work posits that the agent may receive individual reinforcements out of order, which is a relaxation of an important assumption in previous works from the literature. To that end, a stochastic time delay is introduced into a mobile robot line-following application. The main contribution of this work is to provide a novel stochastic approximation algorithm, which is an extension of Q-learning, for the time-delayed reinforcement problem. The paper includes a proof of convergence as well as grid world simulation results from MATLAB, results of line-following simulations within the Cyberbotics Webots mobile robot simulator, and finally, experimental results using an e-Puck mobile robot to follow a real track despite the presence of large, stochastic time delays in its reinforcement signal.
Similar content being viewed by others
References
Arel, I., Liu, C., Urbanik, T., Kohls, A.G.: Reinforce-ment learning-based multi-agent system for network traffic signal control. Intell. Trans. Syst. IET 4(2), 128–135 (2010)
Borrajo, D., Parker, L.E., et al.: A reinforcement learning algorithm in cooperative multi-robot domains. J. Intell. Robot. Syst. 43(2-4), 161–174 (2005)
Campbell, A.S., Schwartz, H.M.: Multiple model control improvements: hypothesis testing and modified model arrangement. Control Intell. Syst. 35(3), 236–243 (2007)
Campbell, J.S., Givigi, S.N., Schwartz, H.M.: Multiple model Q-learning for stochastic reinforcement delays. In: Proceedings of the 2014 IEEE international conference on systems, man, and cybernetics. SMC (2014)
Chen, C., Li, H.-X., Dong, D.: Hybrid control for robot navigation - a hierarchical q-learning algorithm. IEEE Robot. Autom. Mag. 15(2), 37–47 (2008)
Chinthalapati, V.L.R., Yadati, N., Karumanchi, R.: Learning dynamic prices in multiseller electronic retail markets with price sensitive customers, stochastic demands, and inventory replenishments. IEEE Trans. Syst., Man, Cybern., Part C: Appl. Rev. 36(1), 92–106 (2006)
Gonzalez-Valenzuela, S., Vuong, S.T., Leung, V.C.M.: A mobile-directory approach to service discovery in wire- less ad hoc networks. IEEE Trans. Mob. Comput. 7(10), 1242–1256 (2008)
Jaakkola, T., Jordan, M.I., Singh, S.P.: On the convergence of stochastic iterative dynamic programming algorithms. Neural Comput. 6(6), 1185–1201 (1994)
Kartoun, U., Stern, H., Edan, Y.: A human-robot collaborative reinforcement learning algorithm. J. Intell. Robot. Syst. 60(2), 217–239 (2010)
Katsikopoulos, K.V., Engelbrecht, S.E.: Markov decision processes with delays and asynchronous cost collection. IEEE Trans. Autom. Control 48(4), 568–574 (2003)
Kober, J., Peters, J.: Reinforcement learning in robotics: A survey. In: Wiering, M., Otterlo, M. (eds.) Reinforcement Learning, volume 12 of Adaptation, Learning, and Optimization, pp. 579–610. Springer, Berlin Heidelberg (2012)
Rahimiyan, M., Mashhadi, H.R.: An adaptive q -learning algorithm developed for agent-based computational modeling of electricity market. IEEE Trans. Syst., Man, Cybern., Part C: Appl. Rev. 40(5), 547–556 (2010)
Ribeiro, C.H.C.: Embedding a priori knowledge in reinforcement learning. J. Intell. Robot. Syst. 21(1), 51–71 (1998)
Sahingoz, O.K.: Networking models in flying ad-hoc networks (FANETs): Concepts and challenges. J. Intell. Robot. Syst. 74(1-2), 513–527 (2014)
Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn. 3(1), 9–44 (1988)
Sutton, R.S., Andrew, G.B.: Reinforcement learning: Introduction (1998)
Szita, I., Lőrincz, A.: Optimistic initialization and greediness lead to polynomial time learning in factored mdps. In: Proceedings of the 26th annual international conference on machine learning, pp. 1001–1008. ACM (2009)
Teboul, O., Kokkinos, I., Simon, L., Koutsourakis, P., Paragios, N.: Parsing facades with shape grammars and reinforcement learning. IEEE Trans. Pattern Anal. Mach. Intell. 35(7), 1744–1756 (2013)
Tsitsiklis, J.N.: Asynchronous stochastic approximation and Q-learning. Mach. Learn. 16(3), 185–202 (1994)
Walsh, T.J., Nouri, A., Li, L., Littman, M.L.: Learning and planning in environments with delayed feedback. Auton. Agents Multi-Agent Syst. 18(1), 83–105 (2009)
Wang, H., Gao, Y., Chen, X.: Rl-dot: A reinforcement learning npc team for playing domination games. IEEE Trans. Comput. Intell. AI Games 2(1), 17–26 (2010)
Watkins, C.J.CH., Dayan, P.: Q-learning. Machine Learning (1992)
Cornish, C.J., Watkins, H.: Learning from delayed rewards. PhD thesis, University of Cambridge (1989)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Campbell, J.S., Givigi, S.N. & Schwartz, H.M. Multiple Model Q-Learning for Stochastic Asynchronous Rewards. J Intell Robot Syst 81, 407–422 (2016). https://doi.org/10.1007/s10846-015-0222-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10846-015-0222-2