Autonomous Robots

, Volume 34, Issue 4, pp 327–346 | Cite as

DCOB: Action space for reinforcement learning of high DoF robots

  • Akihiko Yamaguchi
  • Jun Takamatsu
  • Tsukasa Ogasawara


Reinforcement learning (RL) for robot control is an important technology for future robots since it enables us to design a robot’s behavior using the reward function. However, RL for high degree-of-freedom robot control is still an open issue. This paper proposes a discrete action space DCOB which is generated from the basis functions (BFs) given to approximate a value function. The remarkable feature is that, by reducing the number of BFs to enable the robot to learn quickly the value function, the size of DCOB is also reduced, which improves the learning speed. In addition, a method WF-DCOB is proposed to enhance the performance, where wire-fitting is utilized to search for continuous actions around each discrete action of DCOB. We apply the proposed methods to motion learning tasks of a simulated humanoid robot and a real spider robot. The experimental results demonstrate outstanding performance.


Reinforcement learning Action space Motion learning Humanoid robot Crawling 



Part of this work was supported by a Grant-in-Aid for JSPS, Japan Society for the Promotion of Science, Fellows (22\(\cdot {}\)9030).

Supplementary material

10514_2013_9328_MOESM1_ESM.mpg (22.5 mb)
Supplementary material 1 (mpg 23060 KB)


  1. Asada, M., Noda, S., & Hosoda, K. (1996). Action-based sensor space categorization for robot learning. In The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS ’96) (pp. 1502–1509).Google Scholar
  2. Baird, L.C., & Klopf, A.H. (1993). Reinforcement learning with high-dimensional, continuous actions. Technical Report WL-TR-93-1147, Wright Laboratory, Wright-Patterson Air Force Base.Google Scholar
  3. Barron, A. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945. doi: 10.1109/18.256500.zbMATHCrossRefMathSciNetGoogle Scholar
  4. Doya, K., Samejima, K., Katagiri, K., & Kawato, M. (2002). Multiple model-based reinforcement learning. Neural Computation, 14(6), 1347–1369. doi: 10.1162/089976602753712972.zbMATHCrossRefGoogle Scholar
  5. Gaskett, C., Fletcher, L., & Zelinsky, A. (2000). Reinforcement learning for a vision based mobile robot. In The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’00).Google Scholar
  6. Ijspeert, A., & Schaal, S. (2002). Learning attractor landscapes for learning motor primitives. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems (pp. 1547–1554). Cambridge: MIT Press.Google Scholar
  7. Kimura, H., Yamashita, T., & Kobayashi, S. (2001). Reinforcement learning of walking behavior for a four-legged robot. In Proceedings of the 40th IEEE Conference on Decision and Control. Portugal.Google Scholar
  8. Kirchner, F. (1998). Q-learning of complex behaviours on a six-legged walking machine. Robotics and Autonomous Systems, 25(3–4), 253–262. doi: 10.1016/S0921-8890(98)00054-2.CrossRefMathSciNetGoogle Scholar
  9. Kober, J., & Peters, J. (2009). Learning motor primitives for robotics. In The IEEE International Conference on Robotics and Automation (ICRA’09) (pp. 2509–2515).Google Scholar
  10. Kondo, T., & Ito, K. (2004). A reinforcement learning with evolutionary state recruitment strategy for autonomous mobile robots control. Robotics and Autonomous Systems, 46(2), 111–124.CrossRefGoogle Scholar
  11. Loch, J., & Singh, S. (1998). Using eligibility traces to find the best memoryless policy in partially observable markov decision processes. In Proceedings of the Fifteenth International Conference on Machine Learning. (pp. 323–331).Google Scholar
  12. Matsubara, T., Morimoto, J., Nakanishi, J., Hyon, S., Hale, J.G., & Cheng, G. (2007). Learning to acquire whole-body humanoid CoM movements to achieve dynamic tasks. In The IEEE International Conference on Robotics and Automation (ICRA’07). (pp. 2688–2693). doi: 10.1109/ROBOT.2007.363871.
  13. Mcgovern, A., & Barto, A.G. (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. In The Eighteenth International Conference on Machine Learning. (pp. 361–368). San Mateo, CA: Morgan Kaufmann.Google Scholar
  14. Menache, I., Mannor, S., & Shimkin, N. (2002). Q-cut - dynamic discovery of sub-goals in reinforcement learning. In ECML ’02: Proceedings of the 13th European Conference on Machine Learning (pp. 295–306). London: Springer.Google Scholar
  15. Miyamoto, H., Morimoto, J., Doya, K., & Kawato, M. (2004). Reinforcement learning with via-point representation. Neural Networks, 17(3), 299–305. doi: 10.1016/j.neunet.2003.11.004.zbMATHCrossRefGoogle Scholar
  16. Moore, A. W., & Atkeson, C. G. (1995). The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning, 21(3), 199–233. doi: 10.1023/A:1022656217772.Google Scholar
  17. Morimoto, J., & Doya, K. (1998). Reinforcement learning of dynamic motor sequence: Learning to stand up. In The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’98). (pp 1721–1726).Google Scholar
  18. Morimoto, J., & Doya, K. (2001). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36(1), 37–51. doi: 10.1016/S0921-8890(01)00113-0.zbMATHCrossRefGoogle Scholar
  19. Nakamura, Y., Mori, T., Sato, M., & Ishii, S. (2007). Reinforcement learning for a biped robot based on a CPG-actor-critic method. Neural Networks, 20(6), 723–735. doi: 10.1016/j.neunet.2007.01.002.zbMATHCrossRefGoogle Scholar
  20. Peng, J., & Williams, R. J. (1994). Incremental multi-step Q-learning. In International Conference on Machine Learning. (pp. 226–232).Google Scholar
  21. Peters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics. In IEEE-RAS International Conference on Humanoid Robots. Karlsruhe, Germany.Google Scholar
  22. Sato, M., & Ishii, S. (2000). On-line EM algorithm for the normalized Gaussian network. Neural Computation, 12(2), 407–432.CrossRefGoogle Scholar
  23. Sedgewick, R., & Wayne, K. (2011). Algorithms. Boston: Addison-Wesley.Google Scholar
  24. Stolle, M. (2004). Automated discovery of options in reinforcement learning (Master’s thesis, McGill University).Google Scholar
  25. Sutton, R., & Barto, A. (1998). Reinforcement Learning: An Introduction. Cambridge: MIT Press. Retrieved from
  26. Sutton, R. S., Precup, D., & Singh, S. (1999). Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, 181–211.zbMATHCrossRefMathSciNetGoogle Scholar
  27. Takahashi, Y., & Asada, M. (2003). Multi-layered learning systems for vision-based behavior acquisition of a real mobile robot. In Proceedings of SICE Annual Conference 2003 (pp. 2937–2942).Google Scholar
  28. Tham, C. K., & Prager, R. W. (1994). A modular Q-learning architecture for manipulator task decomposition. In The Eleventh International Conference on Machine Learning (pp. 309–317).Google Scholar
  29. Theodorou, E., Buchli, J., & Schaal, S. (2010). Reinforcement learning of motor skills in high dimensions: A path integral approach. In The IEEE International Conference on Robotics and Automation (ICRA’10) (pp. 2397–2403). doi: 10.1109/ROBOT.2010.5509336.
  30. Tsitsiklis, J. N., & Roy, B. V. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22, 59–94.zbMATHGoogle Scholar
  31. Tsitsiklis, J. N., & Roy, B. V. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.Google Scholar
  32. Uchibe, E., Doya, K. (2004). Competitive-cooperative-concurrent reinforcement learning with importance sampling. In The International Conference on Simulation of Adaptive Behavior: From Animals and Animats (pp. 287–296).Google Scholar
  33. Wolpert, D. M., & Kawato, M. (1998). Multiple paired forward and inverse models for motor control. Neural Networks, 11(7), 1317–1329.CrossRefGoogle Scholar
  34. Yamaguchi, A. (2011). Highly modularized learning system for behavior acquisition of functional robots. Ph.D. Thesis, Nara Institute of Science and Technology, Japan.Google Scholar
  35. Zhang, J., & Rössler, B. (2004). Self-valuing learning and generalization with application in visually guided grasping of complex objects. Robotics and Autonomous Systems, 47(2), 117–127.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Akihiko Yamaguchi
    • 1
  • Jun Takamatsu
    • 1
  • Tsukasa Ogasawara
    • 1
  1. 1.Graduate School of Information ScienceNara Institute of Science and TechnologyNaraJapan

Personalised recommendations