Behavior Adaptation by Means of Reinforcement Learning

  • Marc CarrerasEmail author
  • Andrés El-fakdi
  • Pere Ridao


Machine learning techniques can be used for learning the action-decision problem that most autonomous robots have when working in unknown and changing environments. Reinforcement learning (RL) offers the possibility of learning a state-action policy that solves a particular task without any previous experience. A reinforcement function, designed by a human operator, is the only required information to determine, after some experimentation, the way of solving the task. This chapter proposes the use of RL algorithms to learn reactive AUV behaviors and therefore not having to define the state-action mapping to solve the task. The algorithms will find the policy that optimizes the task and will adapt to any environment dynamics encountered. The advantage of the approach is that the same algorithms can be applied to a range of tasks, assuming that the problem is correctly sensed and defined. The two main methodologies that have been applied in RL-based robot learning for the past 2 decades, value-function methods and policy gradient methods, are presented in this chapter and evaluated in two AUV tasks. In both cases, a well-known theoretical algorithm has been modified to fulfill the requirements of the AUV task and has been applied with a real AUV. Results show the effectiveness of both approaches, each of them with some advantages and disadvantages, and point out the further investigation of these methods for making AUVs perform more robustly and adaptively in future applications.


Optimal Policy Reinforcement Learning Autonomous Underwater Vehicle Learning Sample Future Reward 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This research was sponsored by the Spanish government (DPI2008-06548-C03-03, DPI2011-27977-C03-02) and the PANDORA EU FP7-Project under the grant agreement No: ICT-288273.


  1. 1.
    Aberdeen DA (2003) Policy-gradient algorithms for partially observable Markov decision processes. PhD thesis, Australian National University, April 2003Google Scholar
  2. 2.
    Amari S (1998) Natural gradient works efficiently in learning. Neural Comput 10:251–276Google Scholar
  3. 3.
    Anderson C (2000) Approximating a policy can be easier than approximating a value function. Computer science technical report, University of Colorado StateGoogle Scholar
  4. 4.
    Bagnell JA, Schneider JG (2001) Autonomous helicopter control using reinforcement learning policy search methods. In: IEEE international conference on robotics and automation, ICRA. KoreaGoogle Scholar
  5. 5.
    Baird K (1995) Residual algorithms: reinforcement learning with function approximation. In: 12th international conference on machine learning, ICML. San Francisco, USAGoogle Scholar
  6. 6.
    Baird LC, Klopf AH (1993) Reinforcement learning with high-dimensional, continuous action. Technical report WL-TR-93-1147, Wright LaboratoryGoogle Scholar
  7. 7.
    Barto AG, Sutton RS, Anderson CW (1983) Neuronlike elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 13:835–846Google Scholar
  8. 8.
    Baxter J, Bartlett PL (2000) Direct gradient-based reinforcement learning. In: International symposium on circuits and systems. Geneva, Switzerland, May 2000Google Scholar
  9. 9.
    Bellman RE (1957) Dynamic programming. Princeton University Press, PrincetonGoogle Scholar
  10. 10.
    Benbrahim H, Doleac J, Franklin J, Selfridge O (1992) Real-time learning: a ball on a beam. In: International joint conference on neural networks (IJCNN). Baltimore, MD, USAGoogle Scholar
  11. 11.
    Benbrahim H, Franklin J (1997) Biped dynamic walking using reinforcement. Robot Auton Syst 22:283–302Google Scholar
  12. 12.
    Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, BelmontGoogle Scholar
  13. 13.
    Carreras M, Ridao P, El-Fakdi A (2003) Semi-online neural-Q-learning for real-time robot learning. In: IEEE/RSJ international conference on intelligent robots and systems, IROS. Las Vegas, USA, October 27–31 2003Google Scholar
  14. 14.
    Carreras M, Ridao P, Garcia R, Nicosevici T (2003) Vision-based localization of an underwater robot in a structured environment. In: IEEE international conference on robotics and automation, ICRA. Taipei, TaiwanGoogle Scholar
  15. 15.
    Carreras M, Yuh J, Batlle J, Ridao P (2007) Application of sonql for real-time learning of robot behaviors. Robot Auton Syst 55(8):628–642Google Scholar
  16. 16.
    Crites RH, Barto AG (1996) Improving elevator performance using reinforcement learning. In: Advances in neural information processing systems, NIPS. MIT Press, Cambridge, MAGoogle Scholar
  17. 17.
    Dayan P (1992) The convergence of TD(\(\lambda \)) for general \(\lambda \). J Mach Learn 8:341–362Google Scholar
  18. 18.
    Gordon GJ (1999) Approximate solutions to Markov decision processes. PhD thesis, Carnegie-Mellon UniversityGoogle Scholar
  19. 19.
    Gullapalli V, Franklin JJ, Benbrahim H (1994) Acquiring robot skills via reinforcement. IEEE Control Syst J, Spec Issue on Robot: Capturing Nat Motion 4(1):13–24Google Scholar
  20. 20.
    Haykin S (1999) Neural networks, a comprehensive foundation, 2nd edn. Prentice Hall, Englewood CliffsGoogle Scholar
  21. 21.
    Hernandez N, Mahadevan S (2000) Hierarchical memory-based reinforcement learning. In: Advances in neural information processing systems, NIPS. Denver, USAGoogle Scholar
  22. 22.
    Konda VR, Tsitsiklis JN (2003) On actor-critic algorithms. SIAM J Control Optim 42(4): 1143–1166Google Scholar
  23. 23.
    Kormushev P, Nomoto K, Dong F, Hirota K (2011) Time hopping technique for faster reinforcement learning in simulations. International Journal of Cybernetics and Information Technologies 11(3):42–59Google Scholar
  24. 24.
    Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. J Mach Learn 8(3/4):293–321Google Scholar
  25. 25.
    Marbach P, Tsitsiklis JN (2000) Gradient-based optimization of Markov reward processes: practical variants. Technical report, Center for Communications Systems Research, University of Cambridge, March 2000Google Scholar
  26. 26.
    Meuleau N, Peshkin L, Kim K (2001) Exploration in gradient based reinforcement learning. Technical report, Massachusetts Institute of Technology, AI Memo 2001–2003, April 2001Google Scholar
  27. 27.
    Ortiz A, Simo M, Oliver G (2002) A vision system for an underwater cable tracker. Int J Mach Vis Appl 13(3):129–140Google Scholar
  28. 28.
    Peters J (2007) Machine learning of motor skills for robotics. PhD thesis, Department of Computer Science, University of Southern CaliforniaGoogle Scholar
  29. 29.
    Peters J, Schaal S (2006) Policy gradient methods for robotics. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS. Beijing, China, October 9–15 2006Google Scholar
  30. 30.
    Peters J, Vijayakumar S, Schaal S (2005) Natural actor-critic. In: ECML, pp 280–291. AustriaGoogle Scholar
  31. 31.
    Precup D, Sutton RS, Dasgupta S (2001) Off-policy temporal-difference learning with function approximation. In: 18th international conference on machine learning, ICML. San FranciscoGoogle Scholar
  32. 32.
    Pyeatt LD, Howe AE (1998) Decision tree function approximation in reinforcement learning. Technical Report TR CS-98-112, Colorado State UniversityGoogle Scholar
  33. 33.
    Ribas D, Palomeras N, Ridao P, Carreras M, Hernandez E (2007) Ictineu AUV wins the first SAUC-E competition. In: IEEE international conference on robotics and automation, ICRA. RomaGoogle Scholar
  34. 34.
    Richter S, Aberdeen D, Yu J (2006) Natural actor-critic for road traffic optimisation. In: Neural information processing systems, NIPS, pp 1169–1176. CambridgeGoogle Scholar
  35. 35.
    Singh SP, Jaakkola T, Jordan MI (1994) Learning without state-estimation in partially observable markovian decision processes. In: 11th international conference on machine learning, ICML, pp 284–292. Morgan Kaufmann, New Jersey, USAGoogle Scholar
  36. 36.
    Singh SP, Jaakkola T, Jordan MI (1995) Reinforcement learning with soft state aggregation. Advances in neural information processing systems, NIPS, 7. CambridgeGoogle Scholar
  37. 37.
    Smart WD, Kaelbling LP (2000) Practical reinforcement learning in continuous spaces. In: International conference on machine learning, ICML. San FranciscoGoogle Scholar
  38. 38.
    Sutton RS (1988) Learning to predict by the method of temporal differences. J Mach Learn 3:9–44Google Scholar
  39. 39.
    Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: 7th international workshop on machine learning, pp 216–224. San FranciscoGoogle Scholar
  40. 40.
    Sutton RS (1996) Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in Neural Information Processing Systems, NIPS 9:1038–1044Google Scholar
  41. 41.
    Sutton RS, Barto A (1998) Reinforcement Learning, an introduction. MIT Press, CambridgeGoogle Scholar
  42. 42.
    Sutton RS, McAllester D, Singh S, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, NIPS 12:1057–1063Google Scholar
  43. 43.
    Tesauro GJ (1992) Practical issues in temporal difference learning. J Mach Learn 8(3/4): 257–277Google Scholar
  44. 44.
    Tsitsiklis JN, Van Roy B (1996) Feature-based methods for large scale dynamic programming. J Mach Learn 22:59–94Google Scholar
  45. 45.
    Tsitsiklis JN, Van Roy B (1997) Average cost temporal-difference learning. IEEE Trans Autom Control 42:674–690Google Scholar
  46. 46.
    Ueno T, Nakamura Y, Shibata T, Hosoda K, Ishii S (2006) Stable learning of quasi-passive dynamic walking by an unstable biped robot based on off-policy natural actor-critic. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROSGoogle Scholar
  47. 47.
    Watkins CJCH, Dayan P (1992) Q-learning. J Mach Learn 8:279–292Google Scholar
  48. 48.
    Weaver S, Baird L, Polycarpou M (1998) An analytical framework for local feedforward networks. IEEE Trans Neural Netw 9(3)Google Scholar
  49. 49.
    Zhang W, Dietterich TG (1995) A reinforcement learning approach to job-shop scheduling. In: International Joint Conference on Artificial Intelligence, IJCAIGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Computer Vision and Robotics GroupUniversity of Girona, Edifici PIVGironaSpain
  2. 2.Control Engineering and Intelligent Systems GroupUniversity of Girona, Edifici PIVGironaSpain

Personalised recommendations