DCOB: Action space for reinforcement learning of high DoF robots


Reinforcement learning (RL) for robot control is an important technology for future robots since it enables us to design a robot’s behavior using the reward function. However, RL for high degree-of-freedom robot control is still an open issue. This paper proposes a discrete action space DCOB which is generated from the basis functions (BFs) given to approximate a value function. The remarkable feature is that, by reducing the number of BFs to enable the robot to learn quickly the value function, the size of DCOB is also reduced, which improves the learning speed. In addition, a method WF-DCOB is proposed to enhance the performance, where wire-fitting is utilized to search for continuous actions around each discrete action of DCOB. We apply the proposed methods to motion learning tasks of a simulated humanoid robot and a real spider robot. The experimental results demonstrate outstanding performance.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17


  1. 1.

    For a vector \(\mathbf{x}=(x_1,\dots {},x_D)\), the maximum norm is defined as \(\Vert \mathbf{x}\Vert _\infty = \max _m{|x_m|}\).

  2. 2.

    We do not abbreviate the trajectory by observing the output of BFs since when the dynamics is a POMDP, using the BFs output to terminate the action may complicate the dynamics more.

  3. 3.

    Actually, unit division and unit deletion are implemented.

  4. 4.

    Open Dynamics Engine: http://www.ode.org/

  5. 5.

    We start the EM algorithm with \(200\) BFs, and obtain the \(202\) trained BFs.

  6. 6.

    The term, \(\dot{c}_{0x}(t) e_{\mathrm{{z}}1}(t) + \dot{c}_{0y}(t) e_{\mathrm{{z}}2}(t)\), indicates the velocity of the body link projected into the \((e_{\mathrm{{z}}1},e_{\mathrm{{z}}2},0)\) direction; that is, the \(x\)\(y\) direction from the body link to the head link.

  7. 7.

    A laptop PC: Pentium M \(2 \text{ GHz }\) CPU, \(512 \text{ MB }\) RAM, Debian Linux.

  8. 8.

    We assume that a simple PD-controller is used as the low-level controller.

  9. 9.

    \(\varvec{\varSigma }_k^\mathcal Q \) is calculated from the original covariance matrix \(\varvec{\varSigma }_k\) (on the \(\mathcal X \) space) as follows. For ease of calculation, let \(\mathbf{C}_{\mathrm{{P}}}(\mathbf{x})=\hat{\text{ C }}_\mathrm{{p}}\mathbf{x}\) where \(\hat{\text{ C }}_\mathrm{{p}}\) is a constant matrix. The converted covariance matrix is \(\varvec{\varSigma }_k^\mathcal Q = \hat{\text{ C }}_\mathrm{{p}} \varvec{\varSigma }_k \hat{\text{ C }}_\mathrm{{p}}^\top \).


  1. Asada, M., Noda, S., & Hosoda, K. (1996). Action-based sensor space categorization for robot learning. In The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS ’96) (pp. 1502–1509).

  2. Baird, L.C., & Klopf, A.H. (1993). Reinforcement learning with high-dimensional, continuous actions. Technical Report WL-TR-93-1147, Wright Laboratory, Wright-Patterson Air Force Base.

  3. Barron, A. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945. doi:10.1109/18.256500.

    MATH  Article  MathSciNet  Google Scholar 

  4. Doya, K., Samejima, K., Katagiri, K., & Kawato, M. (2002). Multiple model-based reinforcement learning. Neural Computation, 14(6), 1347–1369. doi:10.1162/089976602753712972.

    MATH  Article  Google Scholar 

  5. Gaskett, C., Fletcher, L., & Zelinsky, A. (2000). Reinforcement learning for a vision based mobile robot. In The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’00).

  6. Ijspeert, A., & Schaal, S. (2002). Learning attractor landscapes for learning motor primitives. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems (pp. 1547–1554). Cambridge: MIT Press.

    Google Scholar 

  7. Kimura, H., Yamashita, T., & Kobayashi, S. (2001). Reinforcement learning of walking behavior for a four-legged robot. In Proceedings of the 40th IEEE Conference on Decision and Control. Portugal.

  8. Kirchner, F. (1998). Q-learning of complex behaviours on a six-legged walking machine. Robotics and Autonomous Systems, 25(3–4), 253–262. doi:10.1016/S0921-8890(98)00054-2.

    Article  MathSciNet  Google Scholar 

  9. Kober, J., & Peters, J. (2009). Learning motor primitives for robotics. In The IEEE International Conference on Robotics and Automation (ICRA’09) (pp. 2509–2515).

  10. Kondo, T., & Ito, K. (2004). A reinforcement learning with evolutionary state recruitment strategy for autonomous mobile robots control. Robotics and Autonomous Systems, 46(2), 111–124.

    Article  Google Scholar 

  11. Loch, J., & Singh, S. (1998). Using eligibility traces to find the best memoryless policy in partially observable markov decision processes. In Proceedings of the Fifteenth International Conference on Machine Learning. (pp. 323–331).

  12. Matsubara, T., Morimoto, J., Nakanishi, J., Hyon, S., Hale, J.G., & Cheng, G. (2007). Learning to acquire whole-body humanoid CoM movements to achieve dynamic tasks. In The IEEE International Conference on Robotics and Automation (ICRA’07). (pp. 2688–2693). doi:10.1109/ROBOT.2007.363871.

  13. Mcgovern, A., & Barto, A.G. (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. In The Eighteenth International Conference on Machine Learning. (pp. 361–368). San Mateo, CA: Morgan Kaufmann.

  14. Menache, I., Mannor, S., & Shimkin, N. (2002). Q-cut - dynamic discovery of sub-goals in reinforcement learning. In ECML ’02: Proceedings of the 13th European Conference on Machine Learning (pp. 295–306). London: Springer.

  15. Miyamoto, H., Morimoto, J., Doya, K., & Kawato, M. (2004). Reinforcement learning with via-point representation. Neural Networks, 17(3), 299–305. doi:10.1016/j.neunet.2003.11.004.

    MATH  Article  Google Scholar 

  16. Moore, A. W., & Atkeson, C. G. (1995). The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning, 21(3), 199–233. doi:10.1023/A:1022656217772.

    Google Scholar 

  17. Morimoto, J., & Doya, K. (1998). Reinforcement learning of dynamic motor sequence: Learning to stand up. In The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’98). (pp 1721–1726).

  18. Morimoto, J., & Doya, K. (2001). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36(1), 37–51. doi:10.1016/S0921-8890(01)00113-0.

    MATH  Article  Google Scholar 

  19. Nakamura, Y., Mori, T., Sato, M., & Ishii, S. (2007). Reinforcement learning for a biped robot based on a CPG-actor-critic method. Neural Networks, 20(6), 723–735. doi:10.1016/j.neunet.2007.01.002.

    MATH  Article  Google Scholar 

  20. Peng, J., & Williams, R. J. (1994). Incremental multi-step Q-learning. In International Conference on Machine Learning. (pp. 226–232).

  21. Peters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics. In IEEE-RAS International Conference on Humanoid Robots. Karlsruhe, Germany.

  22. Sato, M., & Ishii, S. (2000). On-line EM algorithm for the normalized Gaussian network. Neural Computation, 12(2), 407–432.

    Article  Google Scholar 

  23. Sedgewick, R., & Wayne, K. (2011). Algorithms. Boston: Addison-Wesley.

    Google Scholar 

  24. Stolle, M. (2004). Automated discovery of options in reinforcement learning (Master’s thesis, McGill University).

  25. Sutton, R., & Barto, A. (1998). Reinforcement Learning: An Introduction. Cambridge: MIT Press. Retrieved from http://citeseer.ist.psu.edu/sutton98reinforcement.html.

  26. Sutton, R. S., Precup, D., & Singh, S. (1999). Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, 181–211.

    MATH  Article  MathSciNet  Google Scholar 

  27. Takahashi, Y., & Asada, M. (2003). Multi-layered learning systems for vision-based behavior acquisition of a real mobile robot. In Proceedings of SICE Annual Conference 2003 (pp. 2937–2942).

  28. Tham, C. K., & Prager, R. W. (1994). A modular Q-learning architecture for manipulator task decomposition. In The Eleventh International Conference on Machine Learning (pp. 309–317).

  29. Theodorou, E., Buchli, J., & Schaal, S. (2010). Reinforcement learning of motor skills in high dimensions: A path integral approach. In The IEEE International Conference on Robotics and Automation (ICRA’10) (pp. 2397–2403). doi:10.1109/ROBOT.2010.5509336.

  30. Tsitsiklis, J. N., & Roy, B. V. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22, 59–94.

    MATH  Google Scholar 

  31. Tsitsiklis, J. N., & Roy, B. V. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.

    Google Scholar 

  32. Uchibe, E., Doya, K. (2004). Competitive-cooperative-concurrent reinforcement learning with importance sampling. In The International Conference on Simulation of Adaptive Behavior: From Animals and Animats (pp. 287–296).

  33. Wolpert, D. M., & Kawato, M. (1998). Multiple paired forward and inverse models for motor control. Neural Networks, 11(7), 1317–1329.

    Article  Google Scholar 

  34. Yamaguchi, A. (2011). Highly modularized learning system for behavior acquisition of functional robots. Ph.D. Thesis, Nara Institute of Science and Technology, Japan.

  35. Zhang, J., & Rössler, B. (2004). Self-valuing learning and generalization with application in visually guided grasping of complex objects. Robotics and Autonomous Systems, 47(2), 117–127.

    Article  Google Scholar 

Download references


Part of this work was supported by a Grant-in-Aid for JSPS, Japan Society for the Promotion of Science, Fellows (22\(\cdot {}\)9030).

Author information



Corresponding author

Correspondence to Akihiko Yamaguchi.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mpg 23060 KB)



Appendix A Wire-fitting

For a continuous state \(\mathbf{x}\in \mathcal X \) and a continuous action \(\mathbf{u}\in \mathcal U \), wire-fitting is defined as:

$$\begin{aligned} Q(\mathbf{x},\mathbf{u})&= \lim _{\epsilon \rightarrow 0^+} \frac{\sum _{i\in \mathcal W } (d_i+\epsilon )^{-1}q_i(\mathbf{x})}{\sum _{i\in \mathcal W } (d_i+\epsilon )^{-1}} , \end{aligned}$$
$$\begin{aligned} d_i&= \Vert \mathbf{u}-\mathbf{u}_i(\mathbf{x}) \Vert ^2 + C\bigl [\max _{i^{\prime }\in \mathcal W } (q_{i^{\prime }}(\mathbf{x})) - q_i(\mathbf{x})\bigr ]. \end{aligned}$$

Here, a pair of the functions \(q_i(\mathbf{x}):\mathcal X \rightarrow \mathbb R \) and \(\mathbf{u}_i(\mathbf{x}):\mathcal X \rightarrow \mathcal U \) (\(i\in \mathcal W \)) is called a control wire; wire-fitting is regarded as an interpolator of the set of control wires \(\mathcal W \). \(C\) is the smoothing factor of the interpolation; we choose \(C=0.001\) in the experiments. Any function approximator is available for \(q_i(\mathbf{x})\) and \(\mathbf{u}_i(\mathbf{x})\). For any kind of the function approximators, one of \(q_i(\mathbf{x})\), \(i\in \mathcal W \) is equal to \(\max _{\mathbf{u}}{Q(\mathbf{x},\mathbf{u})}\) and the corresponding \(\mathbf{u}_i(\mathbf{x})\) is the greedy action at \(\mathbf{x}\).

$$\begin{aligned}&\max _{\mathbf{u}}{Q(\mathbf{x},\mathbf{u})} = \max _{i\in \mathcal W } (q_i(\mathbf{x})), \end{aligned}$$
$$\begin{aligned}&\arg \,\max _{\mathbf{u}}{Q(\mathbf{x},\mathbf{u})} = \mathbf{u}_i(\mathbf{x})\Big |_{i=\arg \,\max _{i^{\prime }\in \mathcal W }(q_{i^{\prime }}(\mathbf{x}))}. \end{aligned}$$

Namely, the greedy action at state \(\mathbf{x}\) is calculated only by evaluating \(q_{i}(\mathbf{x})\) for \(i\in \mathcal W \).

We use NGnet for \(q_{i}(\mathbf{x})\) and a constant vector for \(\mathbf{u}_{i}(\mathbf{x})\), that is, we let \(q_i(\mathbf{x})= {\mathbf{\theta }}_i^\top {\mathbf{\phi }}(\mathbf{x})\) and \(\mathbf{u}_i(\mathbf{x})= \mathbf{U}_i\), where \({\mathbf{\phi }}(\mathbf{x})\) is the output vector of the NGnet. The parameter vector \({\mathbf{\theta }}\) is defined as \({\mathbf{\theta }}^\top = ({\mathbf{\theta }}_1^\top , \mathbf{U}_1^\top , {\mathbf{\theta }}_2^\top , \mathbf{U}_2^\top , \dots {}, {\mathbf{\theta }}_{|\mathcal W |}^\top , \mathbf{U}_{|\mathcal W |}^\top ) \), and the gradient \(\mathbf{\nabla }_{\mathbf{\theta }} Q(\mathbf{x},\mathbf{u})\) can be calculated analytically.

Figure 18 shows an example of wire-fitting where both of \(\mathbf{x}\in [-1,1]\) and \(\mathbf{u}\in [-1,1]\) are a one-dimensional vector. There are two control wires (dashed lines) and three basis functions (dotted lines). The BFs (NGnet) are located at \(\mathbf{x}=(-1),(0),(1)\) respectively, and the parameters of the wire-fitting are \({\mathbf{\theta }}_1=(0.0, 0.6, 0.0)^\top , \mathbf{U}_1=(-0.5), {\mathbf{\theta }}_2=(0.0, 0.3, 0.6)^\top , \mathbf{U}_2=(0.5)\). Each control wire is plotted as \((\mathbf{x}, \mathbf{u}_{1}(\mathbf{x}), q_{1}(\mathbf{x}))\) and \((\mathbf{x}, \mathbf{u}_{2}(\mathbf{x}), q_{2}(\mathbf{x}))\) respectively. Each \(\times \)-mark is put at \((\mathbf{x}, \mathbf{u}_{i^\star }(\mathbf{x}), q_{i^\star }(\mathbf{x}))\big |_{i^\star =\arg \,\max _{i}q_{i}(\mathbf{x})}\) which shows the greedy action at \(\mathbf{x}\).

Fig. 18

Example of wire-fitting

Appendix B Calculations of BFTrans

Generating trajectory

The reference trajectory \(\mathbf{q}^\mathrm{{D}}(t_n+t_a),\>{}{} t_a\in [0,T_{\mathrm{F}}]\) is designed so that the state changes from the starting state \(\mathbf{x}_n=\mathbf{x}(t_n)\) to the target \(\mathbf{q}^\mathrm{trg}\) in the time interval \(T_{\mathrm{F}}\). We represent the trajectory with a cubic function,

$$\begin{aligned} {\mathbf{q}}^{\mathrm{D}}(t_n+t_a)= \mathbf{c}_0+ \mathbf{c}_1 t_a+ \mathbf{c}_2 t_a^2+ \mathbf{c}_3 t_a^3, \end{aligned}$$

where \(\mathbf{c}_{0,\dots {},3}\) are the coefficient vectors. These coefficients are determined by the boundary conditions,

$$\begin{aligned}&\mathbf{q}^\mathrm{{D}}(t_n)=\mathbf{C}_{\mathrm{{P}}}(\mathbf{x}_n) ,\quad \mathbf{q}^\mathrm{{D}}(t_n+T_{\mathrm{F}})=\mathbf{q}^\mathrm{trg}, \nonumber \\&\dot{\mathbf{q}}^\mathrm{{D}}(t_n+T_{\mathrm{F}})=\mathbf{0},\quad {\ddot{\mathbf{q}}}^\mathrm{{D}}(t_n+T_{\mathrm{F}})=\mathbf{0}, \end{aligned}$$

where \(\mathbf{0}\) denotes a zero vector.

Abbreviating trajectory

The abbreviation is performed as follows: (1) estimate \(D_\mathrm{{N}}(\mathbf{x}_n)\) as the distance between two neighboring BFs around the start state \(\mathbf{x}_n\), (2) calculate \(T_{\mathrm{N}}\) from the ratio of \(D_\mathrm{{N}}(\mathbf{x}_n)\) and the distance between \(\mathbf{x}_n\) and \(\mathbf{q}^\mathrm{trg}\).

To define \(D_\mathrm{{N}}(\mathbf{x}_n)\), for each BF \(k\), we first calculate \(d_\mathrm{{N}}(k)\) as the distance between its center \({\mathbf{\mu }}_{k}\) and the center of the nearest BF from \(k\). Then, we estimate \(D_\mathrm{{N}}(\mathbf{x}_n)\) by interpolating \(\{d_\mathrm{{N}}(k)|k\in \mathcal{K }\}\) with the output of the BFs at \(\mathbf{x}_n\).

\(d_\mathrm{{N}}(k)\) is calculated by

$$\begin{aligned} k_\mathrm{{N}}(k)&= \text{ arg } \text{ min }_{k^{\prime }\in \mathcal{K }, k^{\prime }\ne k} \Vert \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k^{\prime }}) - \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k})\Vert _\infty , \end{aligned}$$
$$\begin{aligned} d_\mathrm{{N}}(k)&= \max {}\bigl ( \Vert \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k_\mathrm{{N}}(k)}) - \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k})\Vert _\infty ,\>{}\text{ d }_{\min {}k} \bigr ), \end{aligned}$$

where \(\text{ d }_{\min {}k}\in \mathbb R \) is a positive constant to adjust \(d_\mathrm{{N}}(k)\) when \(\Vert \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k_\mathrm{{N}}(k)}) - \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k})\Vert _\infty \) is too small. For NGnet, we define it as \(\text{ d }_{\min {}k}= \sqrt{\lambda _{k}^\mathcal Q }\) where \(\lambda _{k}^\mathcal Q \) is the maximum eigenvalue of the covariance matrix \(\varvec{\varSigma }_k^\mathcal Q \) on the \(\mathcal Q \) spaceFootnote 9. Note that we can pre-compute \(\{d_\mathrm{{N}}(k)|k\in \mathcal{K }\}\) for fixed BFs.

Using the output of BFs \({\mathbf{\phi }}(\mathbf{x}_n)\), \(D_\mathrm{{N}}(\mathbf{x}_n)\) is estimated by

$$\begin{aligned} D_\mathrm{{N}}(\mathbf{x}_n) = (d_\mathrm{{N}}(1),\>{}d_\mathrm{{N}}(2),\ldots ,\>{}d_\mathrm{{N}}(|\mathcal{K }|))^\top {\mathbf{\phi }}(\mathbf{x}_n) \end{aligned}$$

Finally, \(T_{\mathrm{N}}\) is defined by

$$\begin{aligned} T_{\mathrm{N}}(\mathbf{x}_n,\mathbf{u}_n) = \min \Bigl (1,\frac{\text{ F }_\mathrm{abbrv} D_\mathrm{{N}}(\mathbf{x}_n)}{\Vert \mathbf{q}^\mathrm{trg}-\mathbf{C}_{\mathrm{{P}}}(\mathbf{x}_n)\Vert _\infty }\Bigr ) T_{\mathrm{F}}. \end{aligned}$$

Appendix C Initialization and constraints of WF-DCOB

Initializing wire-fitting parameters

For a control wire \(i\in \mathcal W \), we use \(a_{i}^{\mathrm{dcob}}\) to denote the corresponding action in DCOB: \(a_{i}^{\mathrm{dcob}} = (g_{i}^{\mathrm{dcob}}, k_{i}^{\mathrm{dcob}})\). Let \((\text{ g }^\mathrm{{S}}_i, \text{ g }^\mathrm{{E}}_i)\) denote the range of the interval factor which includes \(g_{i}^{\mathrm{dcob}}\). For each control wire \(i\in \mathcal W \), its parameter is defined as \(\mathbf{U}_i=(g_i,\mathbf{q}^\mathrm{trg}_i)\) and is initialized by

$$\begin{aligned}&g_i \leftarrow \frac{\text{ g }^\mathrm{{S}}_i + \text{ g }^\mathrm{{E}}_i}{2}, \end{aligned}$$
$$\begin{aligned}&\mathbf{q}^\mathrm{trg}_i \leftarrow \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k_{i}^{\mathrm{dcob}}}). \end{aligned}$$

The other parameters of the control wires \(\{{\mathbf{\theta }}_i | i\in \mathcal W \}\) are initialized by zero, since, in a learning-from-scratch case, we do not have prior knowledge of the action values.

Constraints on wire-fitting parameters

For \(\mathbf{U}_i=(g_i,\mathbf{q}^\mathrm{trg}_i)\), the interval factor \(g_i\) is constrained inside \((\text{ g }^\mathrm{{S}}_i, \text{ g }^\mathrm{{E}}_i)\), and the target point \(\mathbf{q}^\mathrm{trg}_i\) is constrained inside a hypersphere of radius \(d_\mathrm{{N}}(k_{i}^{\mathrm{dcob}})\) centered at \(\mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k_{i}^{\mathrm{dcob}}})\). Here, \(d_\mathrm{{N}}(k_{i}^{\mathrm{dcob}})\) denotes the distance to the nearest BF from \(k_{i}^{\mathrm{dcob}}\) defined by Eq. 42. Specifically, the parameter \(\mathbf{U}_i=(g_i,\mathbf{q}^\mathrm{trg}_i)\) of each control wire \(i\in \mathcal W \) is constrained by



$$\begin{aligned} \mathbf{diff} \triangleq \mathbf{q}^\mathrm{trg}_i - \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k_{i}^{\mathrm{dcob}}}). \end{aligned}$$

These constraints are applied after each update of an RL algorithm.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Yamaguchi, A., Takamatsu, J. & Ogasawara, T. DCOB: Action space for reinforcement learning of high DoF robots. Auton Robot 34, 327–346 (2013). https://doi.org/10.1007/s10514-013-9328-1

Download citation


  • Reinforcement learning
  • Action space
  • Motion learning
  • Humanoid robot
  • Crawling