Abstract
Machine learning techniques can be used for learning the action-decision problem that most autonomous robots have when working in unknown and changing environments. Reinforcement learning (RL) offers the possibility of learning a state-action policy that solves a particular task without any previous experience. A reinforcement function, designed by a human operator, is the only required information to determine, after some experimentation, the way of solving the task. This chapter proposes the use of RL algorithms to learn reactive AUV behaviors and therefore not having to define the state-action mapping to solve the task. The algorithms will find the policy that optimizes the task and will adapt to any environment dynamics encountered. The advantage of the approach is that the same algorithms can be applied to a range of tasks, assuming that the problem is correctly sensed and defined. The two main methodologies that have been applied in RL-based robot learning for the past 2 decades, value-function methods and policy gradient methods, are presented in this chapter and evaluated in two AUV tasks. In both cases, a well-known theoretical algorithm has been modified to fulfill the requirements of the AUV task and has been applied with a real AUV. Results show the effectiveness of both approaches, each of them with some advantages and disadvantages, and point out the further investigation of these methods for making AUVs perform more robustly and adaptively in future applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aberdeen DA (2003) Policy-gradient algorithms for partially observable Markov decision processes. PhD thesis, Australian National University, April 2003
Amari S (1998) Natural gradient works efficiently in learning. Neural Comput 10:251–276
Anderson C (2000) Approximating a policy can be easier than approximating a value function. Computer science technical report, University of Colorado State
Bagnell JA, Schneider JG (2001) Autonomous helicopter control using reinforcement learning policy search methods. In: IEEE international conference on robotics and automation, ICRA. Korea
Baird K (1995) Residual algorithms: reinforcement learning with function approximation. In: 12th international conference on machine learning, ICML. San Francisco, USA
Baird LC, Klopf AH (1993) Reinforcement learning with high-dimensional, continuous action. Technical report WL-TR-93-1147, Wright Laboratory
Barto AG, Sutton RS, Anderson CW (1983) Neuronlike elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 13:835–846
Baxter J, Bartlett PL (2000) Direct gradient-based reinforcement learning. In: International symposium on circuits and systems. Geneva, Switzerland, May 2000
Bellman RE (1957) Dynamic programming. Princeton University Press, Princeton
Benbrahim H, Doleac J, Franklin J, Selfridge O (1992) Real-time learning: a ball on a beam. In: International joint conference on neural networks (IJCNN). Baltimore, MD, USA
Benbrahim H, Franklin J (1997) Biped dynamic walking using reinforcement. Robot Auton Syst 22:283–302
Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, Belmont
Carreras M, Ridao P, El-Fakdi A (2003) Semi-online neural-Q-learning for real-time robot learning. In: IEEE/RSJ international conference on intelligent robots and systems, IROS. Las Vegas, USA, October 27–31 2003
Carreras M, Ridao P, Garcia R, Nicosevici T (2003) Vision-based localization of an underwater robot in a structured environment. In: IEEE international conference on robotics and automation, ICRA. Taipei, Taiwan
Carreras M, Yuh J, Batlle J, Ridao P (2007) Application of sonql for real-time learning of robot behaviors. Robot Auton Syst 55(8):628–642
Crites RH, Barto AG (1996) Improving elevator performance using reinforcement learning. In: Advances in neural information processing systems, NIPS. MIT Press, Cambridge, MA
Dayan P (1992) The convergence of TD(\(\lambda \)) for general \(\lambda \). J Mach Learn 8:341–362
Gordon GJ (1999) Approximate solutions to Markov decision processes. PhD thesis, Carnegie-Mellon University
Gullapalli V, Franklin JJ, Benbrahim H (1994) Acquiring robot skills via reinforcement. IEEE Control Syst J, Spec Issue on Robot: Capturing Nat Motion 4(1):13–24
Haykin S (1999) Neural networks, a comprehensive foundation, 2nd edn. Prentice Hall, Englewood Cliffs
Hernandez N, Mahadevan S (2000) Hierarchical memory-based reinforcement learning. In: Advances in neural information processing systems, NIPS. Denver, USA
Konda VR, Tsitsiklis JN (2003) On actor-critic algorithms. SIAM J Control Optim 42(4): 1143–1166
Kormushev P, Nomoto K, Dong F, Hirota K (2011) Time hopping technique for faster reinforcement learning in simulations. International Journal of Cybernetics and Information Technologies 11(3):42–59
Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. J Mach Learn 8(3/4):293–321
Marbach P, Tsitsiklis JN (2000) Gradient-based optimization of Markov reward processes: practical variants. Technical report, Center for Communications Systems Research, University of Cambridge, March 2000
Meuleau N, Peshkin L, Kim K (2001) Exploration in gradient based reinforcement learning. Technical report, Massachusetts Institute of Technology, AI Memo 2001–2003, April 2001
Ortiz A, Simo M, Oliver G (2002) A vision system for an underwater cable tracker. Int J Mach Vis Appl 13(3):129–140
Peters J (2007) Machine learning of motor skills for robotics. PhD thesis, Department of Computer Science, University of Southern California
Peters J, Schaal S (2006) Policy gradient methods for robotics. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS. Beijing, China, October 9–15 2006
Peters J, Vijayakumar S, Schaal S (2005) Natural actor-critic. In: ECML, pp 280–291. Austria
Precup D, Sutton RS, Dasgupta S (2001) Off-policy temporal-difference learning with function approximation. In: 18th international conference on machine learning, ICML. San Francisco
Pyeatt LD, Howe AE (1998) Decision tree function approximation in reinforcement learning. Technical Report TR CS-98-112, Colorado State University
Ribas D, Palomeras N, Ridao P, Carreras M, Hernandez E (2007) Ictineu AUV wins the first SAUC-E competition. In: IEEE international conference on robotics and automation, ICRA. Roma
Richter S, Aberdeen D, Yu J (2006) Natural actor-critic for road traffic optimisation. In: Neural information processing systems, NIPS, pp 1169–1176. Cambridge
Singh SP, Jaakkola T, Jordan MI (1994) Learning without state-estimation in partially observable markovian decision processes. In: 11th international conference on machine learning, ICML, pp 284–292. Morgan Kaufmann, New Jersey, USA
Singh SP, Jaakkola T, Jordan MI (1995) Reinforcement learning with soft state aggregation. Advances in neural information processing systems, NIPS, 7. Cambridge
Smart WD, Kaelbling LP (2000) Practical reinforcement learning in continuous spaces. In: International conference on machine learning, ICML. San Francisco
Sutton RS (1988) Learning to predict by the method of temporal differences. J Mach Learn 3:9–44
Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: 7th international workshop on machine learning, pp 216–224. San Francisco
Sutton RS (1996) Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in Neural Information Processing Systems, NIPS 9:1038–1044
Sutton RS, Barto A (1998) Reinforcement Learning, an introduction. MIT Press, Cambridge
Sutton RS, McAllester D, Singh S, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, NIPS 12:1057–1063
Tesauro GJ (1992) Practical issues in temporal difference learning. J Mach Learn 8(3/4): 257–277
Tsitsiklis JN, Van Roy B (1996) Feature-based methods for large scale dynamic programming. J Mach Learn 22:59–94
Tsitsiklis JN, Van Roy B (1997) Average cost temporal-difference learning. IEEE Trans Autom Control 42:674–690
Ueno T, Nakamura Y, Shibata T, Hosoda K, Ishii S (2006) Stable learning of quasi-passive dynamic walking by an unstable biped robot based on off-policy natural actor-critic. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS
Watkins CJCH, Dayan P (1992) Q-learning. J Mach Learn 8:279–292
Weaver S, Baird L, Polycarpou M (1998) An analytical framework for local feedforward networks. IEEE Trans Neural Netw 9(3)
Zhang W, Dietterich TG (1995) A reinforcement learning approach to job-shop scheduling. In: International Joint Conference on Artificial Intelligence, IJCAI
Acknowledgements
This research was sponsored by the Spanish government (DPI2008-06548-C03-03, DPI2011-27977-C03-02) and the PANDORA EU FP7-Project under the grant agreement No: ICT-288273.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media New York
About this chapter
Cite this chapter
Carreras, M., El-fakdi, A., Ridao, P. (2013). Behavior Adaptation by Means of Reinforcement Learning. In: Seto, M. (eds) Marine Robot Autonomy. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-5659-9_7
Download citation
DOI: https://doi.org/10.1007/978-1-4614-5659-9_7
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-5658-2
Online ISBN: 978-1-4614-5659-9
eBook Packages: EngineeringEngineering (R0)