Skip to main content

Behavior Adaptation by Means of Reinforcement Learning

  • Chapter
  • First Online:
Marine Robot Autonomy

Abstract

Machine learning techniques can be used for learning the action-decision problem that most autonomous robots have when working in unknown and changing environments. Reinforcement learning (RL) offers the possibility of learning a state-action policy that solves a particular task without any previous experience. A reinforcement function, designed by a human operator, is the only required information to determine, after some experimentation, the way of solving the task. This chapter proposes the use of RL algorithms to learn reactive AUV behaviors and therefore not having to define the state-action mapping to solve the task. The algorithms will find the policy that optimizes the task and will adapt to any environment dynamics encountered. The advantage of the approach is that the same algorithms can be applied to a range of tasks, assuming that the problem is correctly sensed and defined. The two main methodologies that have been applied in RL-based robot learning for the past 2 decades, value-function methods and policy gradient methods, are presented in this chapter and evaluated in two AUV tasks. In both cases, a well-known theoretical algorithm has been modified to fulfill the requirements of the AUV task and has been applied with a real AUV. Results show the effectiveness of both approaches, each of them with some advantages and disadvantages, and point out the further investigation of these methods for making AUVs perform more robustly and adaptively in future applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 149.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aberdeen DA (2003) Policy-gradient algorithms for partially observable Markov decision processes. PhD thesis, Australian National University, April 2003

    Google Scholar 

  2. Amari S (1998) Natural gradient works efficiently in learning. Neural Comput 10:251–276

    Google Scholar 

  3. Anderson C (2000) Approximating a policy can be easier than approximating a value function. Computer science technical report, University of Colorado State

    Google Scholar 

  4. Bagnell JA, Schneider JG (2001) Autonomous helicopter control using reinforcement learning policy search methods. In: IEEE international conference on robotics and automation, ICRA. Korea

    Google Scholar 

  5. Baird K (1995) Residual algorithms: reinforcement learning with function approximation. In: 12th international conference on machine learning, ICML. San Francisco, USA

    Google Scholar 

  6. Baird LC, Klopf AH (1993) Reinforcement learning with high-dimensional, continuous action. Technical report WL-TR-93-1147, Wright Laboratory

    Google Scholar 

  7. Barto AG, Sutton RS, Anderson CW (1983) Neuronlike elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 13:835–846

    Google Scholar 

  8. Baxter J, Bartlett PL (2000) Direct gradient-based reinforcement learning. In: International symposium on circuits and systems. Geneva, Switzerland, May 2000

    Google Scholar 

  9. Bellman RE (1957) Dynamic programming. Princeton University Press, Princeton

    Google Scholar 

  10. Benbrahim H, Doleac J, Franklin J, Selfridge O (1992) Real-time learning: a ball on a beam. In: International joint conference on neural networks (IJCNN). Baltimore, MD, USA

    Google Scholar 

  11. Benbrahim H, Franklin J (1997) Biped dynamic walking using reinforcement. Robot Auton Syst 22:283–302

    Google Scholar 

  12. Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, Belmont

    Google Scholar 

  13. Carreras M, Ridao P, El-Fakdi A (2003) Semi-online neural-Q-learning for real-time robot learning. In: IEEE/RSJ international conference on intelligent robots and systems, IROS. Las Vegas, USA, October 27–31 2003

    Google Scholar 

  14. Carreras M, Ridao P, Garcia R, Nicosevici T (2003) Vision-based localization of an underwater robot in a structured environment. In: IEEE international conference on robotics and automation, ICRA. Taipei, Taiwan

    Google Scholar 

  15. Carreras M, Yuh J, Batlle J, Ridao P (2007) Application of sonql for real-time learning of robot behaviors. Robot Auton Syst 55(8):628–642

    Google Scholar 

  16. Crites RH, Barto AG (1996) Improving elevator performance using reinforcement learning. In: Advances in neural information processing systems, NIPS. MIT Press, Cambridge, MA

    Google Scholar 

  17. Dayan P (1992) The convergence of TD(\(\lambda \)) for general \(\lambda \). J Mach Learn 8:341–362

    Google Scholar 

  18. Gordon GJ (1999) Approximate solutions to Markov decision processes. PhD thesis, Carnegie-Mellon University

    Google Scholar 

  19. Gullapalli V, Franklin JJ, Benbrahim H (1994) Acquiring robot skills via reinforcement. IEEE Control Syst J, Spec Issue on Robot: Capturing Nat Motion 4(1):13–24

    Google Scholar 

  20. Haykin S (1999) Neural networks, a comprehensive foundation, 2nd edn. Prentice Hall, Englewood Cliffs

    Google Scholar 

  21. Hernandez N, Mahadevan S (2000) Hierarchical memory-based reinforcement learning. In: Advances in neural information processing systems, NIPS. Denver, USA

    Google Scholar 

  22. Konda VR, Tsitsiklis JN (2003) On actor-critic algorithms. SIAM J Control Optim 42(4): 1143–1166

    Google Scholar 

  23. Kormushev P, Nomoto K, Dong F, Hirota K (2011) Time hopping technique for faster reinforcement learning in simulations. International Journal of Cybernetics and Information Technologies 11(3):42–59

    Google Scholar 

  24. Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. J Mach Learn 8(3/4):293–321

    Google Scholar 

  25. Marbach P, Tsitsiklis JN (2000) Gradient-based optimization of Markov reward processes: practical variants. Technical report, Center for Communications Systems Research, University of Cambridge, March 2000

    Google Scholar 

  26. Meuleau N, Peshkin L, Kim K (2001) Exploration in gradient based reinforcement learning. Technical report, Massachusetts Institute of Technology, AI Memo 2001–2003, April 2001

    Google Scholar 

  27. Ortiz A, Simo M, Oliver G (2002) A vision system for an underwater cable tracker. Int J Mach Vis Appl 13(3):129–140

    Google Scholar 

  28. Peters J (2007) Machine learning of motor skills for robotics. PhD thesis, Department of Computer Science, University of Southern California

    Google Scholar 

  29. Peters J, Schaal S (2006) Policy gradient methods for robotics. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS. Beijing, China, October 9–15 2006

    Google Scholar 

  30. Peters J, Vijayakumar S, Schaal S (2005) Natural actor-critic. In: ECML, pp 280–291. Austria

    Google Scholar 

  31. Precup D, Sutton RS, Dasgupta S (2001) Off-policy temporal-difference learning with function approximation. In: 18th international conference on machine learning, ICML. San Francisco

    Google Scholar 

  32. Pyeatt LD, Howe AE (1998) Decision tree function approximation in reinforcement learning. Technical Report TR CS-98-112, Colorado State University

    Google Scholar 

  33. Ribas D, Palomeras N, Ridao P, Carreras M, Hernandez E (2007) Ictineu AUV wins the first SAUC-E competition. In: IEEE international conference on robotics and automation, ICRA. Roma

    Google Scholar 

  34. Richter S, Aberdeen D, Yu J (2006) Natural actor-critic for road traffic optimisation. In: Neural information processing systems, NIPS, pp 1169–1176. Cambridge

    Google Scholar 

  35. Singh SP, Jaakkola T, Jordan MI (1994) Learning without state-estimation in partially observable markovian decision processes. In: 11th international conference on machine learning, ICML, pp 284–292. Morgan Kaufmann, New Jersey, USA

    Google Scholar 

  36. Singh SP, Jaakkola T, Jordan MI (1995) Reinforcement learning with soft state aggregation. Advances in neural information processing systems, NIPS, 7. Cambridge

    Google Scholar 

  37. Smart WD, Kaelbling LP (2000) Practical reinforcement learning in continuous spaces. In: International conference on machine learning, ICML. San Francisco

    Google Scholar 

  38. Sutton RS (1988) Learning to predict by the method of temporal differences. J Mach Learn 3:9–44

    Google Scholar 

  39. Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: 7th international workshop on machine learning, pp 216–224. San Francisco

    Google Scholar 

  40. Sutton RS (1996) Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in Neural Information Processing Systems, NIPS 9:1038–1044

    Google Scholar 

  41. Sutton RS, Barto A (1998) Reinforcement Learning, an introduction. MIT Press, Cambridge

    Google Scholar 

  42. Sutton RS, McAllester D, Singh S, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, NIPS 12:1057–1063

    Google Scholar 

  43. Tesauro GJ (1992) Practical issues in temporal difference learning. J Mach Learn 8(3/4): 257–277

    Google Scholar 

  44. Tsitsiklis JN, Van Roy B (1996) Feature-based methods for large scale dynamic programming. J Mach Learn 22:59–94

    Google Scholar 

  45. Tsitsiklis JN, Van Roy B (1997) Average cost temporal-difference learning. IEEE Trans Autom Control 42:674–690

    Google Scholar 

  46. Ueno T, Nakamura Y, Shibata T, Hosoda K, Ishii S (2006) Stable learning of quasi-passive dynamic walking by an unstable biped robot based on off-policy natural actor-critic. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS

    Google Scholar 

  47. Watkins CJCH, Dayan P (1992) Q-learning. J Mach Learn 8:279–292

    Google Scholar 

  48. Weaver S, Baird L, Polycarpou M (1998) An analytical framework for local feedforward networks. IEEE Trans Neural Netw 9(3)

    Google Scholar 

  49. Zhang W, Dietterich TG (1995) A reinforcement learning approach to job-shop scheduling. In: International Joint Conference on Artificial Intelligence, IJCAI

    Google Scholar 

Download references

Acknowledgements

This research was sponsored by the Spanish government (DPI2008-06548-C03-03, DPI2011-27977-C03-02) and the PANDORA EU FP7-Project under the grant agreement No: ICT-288273.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marc Carreras .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this chapter

Cite this chapter

Carreras, M., El-fakdi, A., Ridao, P. (2013). Behavior Adaptation by Means of Reinforcement Learning. In: Seto, M. (eds) Marine Robot Autonomy. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-5659-9_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-5659-9_7

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-5658-2

  • Online ISBN: 978-1-4614-5659-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics