Skip to main content

Advertisement

Log in

Learning motions from demonstrations and rewards with time-invariant dynamical systems based policies

Autonomous Robots Aims and scope Submit manuscript

Abstract

An important challenge when using reinforcement learning for learning motions in robotics is the choice of parameterization for the policy. We use Gaussian Mixture Regression to extract a parameterization with relevant non-linear features from a set of demonstrations of a motion following the paradigm of learning from demonstration. The resulting parameterization takes the form of a non-linear time-invariant dynamical system (DS). We use this time-invariant DS as a parameterized policy for a variant of the PI2 policy search algorithm. This paper contributes by adapting PI2 for our time-invariant motion representation. We introduce two novel parameter exploration schemes that can be used to (1) sample model parameters to achieve a uniform exploration in state space and (2) explore while ensuring stability of the resulting motion model. Additionally, a state dependent stiffness profile is learned simultaneously to the reference trajectory and both are used together in a variable impedance control architecture. This learning architecture is validated in a hardware experiment consisting of a digging task using a KUKA LWR platform.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. For the 3-dimensional case, the norm follows a Maxwell–Boltzmann distribution with mean \(\mu _v = 2\tilde{\sigma }\sqrt{\frac{2}{\pi }}\).

  2. Note that the divergent force field is a perturbation that has nothing to do with the exploration noise. The divergent force field is not stochastic, and it is part of the system.

  3. This method can improve the robustness by preventing problematic situations where none of the roll-outs yield good performance.

  4. Although for our parameterization the size of the parameters is only loosely related to the actual control that will be applied to the system, we keep the conventional name “control cost” for this term.

References

  • Ajoudani, A., Tsagarakis, N., & Bicchi, A. (2012). Tele-impedance: Teleoperation with impedance regulation using a body-machine interface. The International Journal of Robotics Research, 31(13), 1642–1656.

    Article  Google Scholar 

  • Billard, A., Calinon, S., Dillmann, R., & Schaal, S. (2008). Handbook of Robotics Chapter 59: Robot Programming by Demonstration. In Handbook of Robotics. Springer.

  • Buchli, J., Stulp, F., Theodorou, E., & Schaal, S. (2011). Learning variable impedance control. The International Journal of Robotics Research, 30(7), 820–833.

    Article  Google Scholar 

  • Burdet, E., Osu, R., Franklin, D. W., Milner, T. E., & Kawato, M. (2001). The central nervous system stabilizes unstable dynamics by learning optimal impedance. Nature, 414(6862), 4469.

    Article  Google Scholar 

  • Calinon, S., Bruno, S., & Caldwell, D.G. (2014). A task-parameterized probabilistic model with minimal intervention control. In IEEE International Conference on Robotics and Automation (pp. 3339–3344).

  • Calinon, S., Sardellitti, I., Caldwell, D. (2010). Learning-based control strategy for safe human-robot interaction exploiting task and robot redundancies. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 249–254).

  • Calinon, S., D’halluin, F., Sauser, E. L., Caldwell, D. G., & Billard, A. G. (2010). Learning and reproduction of gestures by imitation. Robotics & Automation Magazine, IEEE, 17(2), 44–54.

    Article  Google Scholar 

  • Calinon, S., Kormushev, P., & Caldwell, D. G. (2013). Compliant skills acquisition and multi-optima policy search with EM-based reinforcement learning. Robotics and Autonomous Systems, 61(4), 369–379.

    Article  Google Scholar 

  • Daniel, C., Neumann, G., & Peters, J. (2012). Learning concurrent motor skills in versatile solution spaces. In Intelligent Robots and Systems (IROS), IEEE/RSJ International Conference on 2012. IEEE, (pp. 3591–3597).

  • Farshidian, F., Neunert, M., & Buchli, J. (2014). Learning of closed-loop motion control. In IEEE International Conference on Intelligent Robots and Systems, no. IROS (pp. 1441–1446).

  • Garabini, M., Passaglia, A., Belo, F., Salaris, P., & Bicchi, A. (2012). Optimality principles in stiffness control: The VSA kick. In IEEE International Conference on Robotics and Automation (pp. 3341–3346).

  • Gribovskaya, E., Khansari-Zadeh, S. M., & Billard, A. (2010). Learning non-linear multivariate dynamics of motion in robotic manipulators. The International Journal of Robotics Research, 30(1), 80–117.

    Article  Google Scholar 

  • Guenter, F., Hersch, M., Calinon, S., & Billard, A. (2007). Reinforcement learning for imitating constrained reaching movements. Advanced Robotics, 21(13), 1521–1544.

    Google Scholar 

  • Gullapalli, V., Franklin, J. A., & Benbrahim, H. (1994). Acquiring robot skills via reinforcement learning. Control Systems, IEEE, 14(1), 13–24.

    Article  Google Scholar 

  • Hogan, N. (1985). Impedance control: An approach to manipulation. Journal of Dynamic Systems Measurement and Control, 107(12), 1–24.

    Article  MATH  Google Scholar 

  • Howard, M., Braun, D. J., & Vijayakumar, S. (2013). Transferring human impedance behavior to heterogeneous variable impedance actuators. IEEE Transactions on Robotics, 29(4), 847–862.

    Article  Google Scholar 

  • Ijspeert, A.J., Nakanishi, J., & Schaal, S. (2002). Movement imitation with nonlinear dynamical systems in humanoid robots. In Proceedings of the 2002 IEEE International Conference on Robotics and Automation, IEEE (Vol. 2, pp. 1398–1403).

  • Khansari-Zadeh, S. M., & Billard, A. (2011). Learning stable non-linear dynamical systems with Gaussian Mixture Models. IEEE Transactions on Robotics, 27, 1–15.

    Article  Google Scholar 

  • Khansari-Zadeh, S. M., & Billard, A. (2011). Learning stable nonlinear dynamical systems with gaussian mixture models. IEEE Transactions on Robotics, 27(5), 943–957.

    Article  Google Scholar 

  • Kober, J., & Peters, J. (2009). Learning motor primitives for robotics. In IEEE International Conference on Robotics and Automation, 2009, ICRA’09, IEEE (pp. 2112–2118).

  • Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32, 1238–1274.

    Article  Google Scholar 

  • Kober, J., & Peters, J. (2010). Policy search for motor primitives in robotics. Machine Learning, 84(1–2), 171–203.

    MathSciNet  MATH  Google Scholar 

  • Kober, J., & Peters, J. (2010). Imitation and reinforcement learning. IEEE Robotics Automation Magazine, 17(2), 55–62.

    Article  MATH  Google Scholar 

  • Kronander, K., Khansari-Zadeh, S. M., & Billard, A. (2015). Incremental motion learning with locally modulated dynamical systems. Robotics and Autonomous Systems, 70, 52–62.

    Article  Google Scholar 

  • Kronander, K., & Billard, A. (2013). Learning compliant manipulation through kinesthetic and tactile human-robot interaction. Transactions on Haptics, 7(3), 1–16.

    Google Scholar 

  • Kronander, K., & Billard, A. (2016). Passive interaction control with dynamical systems. Robotics and Automation Letters, 1(1), 106–113.

    Article  Google Scholar 

  • Lee, A. X., Lu, H., Gupta, A., Levine, S., & Abbeel, P. (2015). Learning force-based manipulation of deformable objects from multiple demonstrations. In IEEE International Conference on Robotics and Automation.

  • Lemme, A., Neumann, K., Reinhart, R., & Steil, J. (2014). Neural learning of vector fields for encoding stable dynamical systems. Neurocomputing, 141, 3–14.

    Article  Google Scholar 

  • Medina, J., Sieber, D., & Hirche, S. (2013). Risk-sensitive interaction control in uncertain manipulation tasks. In IEEE International Conference on Robotics and Automation.

  • Mitrovic, D., Klanke, S., & Vijayakumar, S. (2011). Learning impedance control of antagonistic systems based on stochastic optimization principles. The International Journal of Robotics Research, 30(5), 556–573.

    Article  Google Scholar 

  • Paraschos, A., Daniel, C., Peters, J., & Neumann, G. (2013). Probabilistic movement primitives. Neural Information Processing Systems (pp. 1–9).

  • Pastor, P., Righetti, L., Kalakrishnan, M., & Schaal, S. (2011). Online movement adaptation based on previous sensor experiences. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2011. IEEE (pp. 365–371).

  • Peters, J., & Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71(7–9), 1180–1190.

    Article  Google Scholar 

  • Rozo, L., Calinon, S., Caldwell, D., Jimenez, P., Torras, C., & Jiménez, P. (2013). Learning collaborative impedance-based robot behaviors. In AAAI Conference on Artificial Intelligence.

  • Rückert, E. A., Neumann, G., Toussaint, M., & Maass, W. (2013). Learned graphical models for probabilistic planning provide a new class of movement primitives. Frontiers in Computational Neuroscience, 6(January), 1–20.

    Google Scholar 

  • Schaal, S., Ijspeert, A., & Billard, A. (2003). Computational approaches to motor learning by imitation. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 358(1431), 53747.

    Article  Google Scholar 

  • Selen, L. P. J., Franklin, D. W., & Wolpert, D. M. (2009). Impedance control reduces instability that arises from motor noise. The Journal of Neuroscience, 29(40), 1260616.

    Article  Google Scholar 

  • Stulp, F., & Sigaud, O. (2012). Policy improvement methods: Between black-box optimization and episodic reinforcement learning.

  • Stulp, F., Sigaud, O. (2012). Path integral policy improvement with covariance matrix adaptation. In Proceedings of the 29th International Conference on Machine Learning (ICML-12) (pp. 281–288).

  • Sung, H.G. (2004). Gaussian mixture regression and classification (Ph.D. dissertation, Rice University).

  • Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. Learning, 9(1), 1–23.

    Google Scholar 

  • Tedrake, R., Zhang, T. W., & Seung, H.S. (2004). Stochastic policy gradient reinforcement learning on a simple 3d biped. In Intelligent Robots and Systems, 2004. (IROS 2004). Proceedings of IEEE/RSJ International Conference on 2004. IEEE (Vol. 3, pp. 2849–2854).

  • Tee, K. P., Franklin, D. W., Kawato, M., Milner, T. E., Burdet, E., Peng, K., et al. (2010). Concurrent adaptation of force and impedance in the redundant muscle system. Biological Cybernetics, 102(1), 31–44.

    Article  MATH  Google Scholar 

  • Theodorou, E., Buchli, J., & Schaal, S. (2010). A generalized path integral control approach to reinforcement learning. The Journal of Machine Learning Research, 11, 3137–3181.

    MathSciNet  MATH  Google Scholar 

  • Thijssen, S., & Kappen, H. (2015). Path integral control and state-dependent feedback. Physical Review E, 91(3), 032104.

    Article  MathSciNet  Google Scholar 

  • Toussaint, M. (2009). Probabilistic inference as a model of planned behavior. Künstliche Intelligenz, 3(9), 23–29.

    Google Scholar 

  • Vlassis, N., Toussaint, M., Kontes, G., & Piperidis, S. (2009). Learning model-free robot control by a monte carlo EM algorithm. Autonomous Robots, 27(2), 123–130.

    Article  Google Scholar 

  • Yang, C., Ganesh, G., Haddadin, S., Parusel, S., Albu-Schaffer, A., & Burdet, E. (2011). Human-like adaptation of force and impedance in stable and unstable interactions. IEEE Transactions on Robotics, 27(5), 918–930.

    Article  Google Scholar 

Download references

Acknowledgements

This research was funded by the European Union Seventh Framework Programme FP7/2007-2013 under Grant Agreement No. 288533 ROBOHOW.COG and by the Swiss National Science Foundation through the National Center of Competence in Research Robotics.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Klas Kronander.

Appendices

Appendix 1: Additional details for Sect. 3

1.1 Details of the first simulation experiment

The first simulation experiment is performed with \(N_r = 10\) and 50 learning iterations. The variance of the exploration noise \({\varvec{{\varSigma }}}_{\epsilon }\) is set to obtain a mean norm of 0.1 for the velocity noise vector (see Sect. 4.1).

The cost is composed of the following components:

$$\begin{aligned} \begin{aligned}&{\varPhi }_{r} = ||{\varvec{x}}^{goal} - {\varvec{x}}_{t_{N_t}}||^2 + \sum ^{N_{VP}}_{l = 1} \min \limits _{j=1:N_t}{||{\varvec{x}}^{via}_l - {\varvec{x}}_{t_j}||^2} \, , \\&q_{t_i,r} = 0.0001 \cdot || \dot{{\varvec{x}}}_{t_i,r} - \dot{{\varvec{x}}}_{t_{i-1},r}|| \, , \\&\quad \ \! \varvec{R} = 0.0005 \cdot I \, . \end{aligned} \end{aligned}$$
(39)

At each time step, the immediate cost penalizes the acceleration to promote smoother trajectories. The final cost is the sum of the squared minimum distance to each via-point along the trajectory, plus the squared distance to the goal at the final time step. Unlike the via-points, the goal has to be reached at a specific time, i.e. at the end of the trajectory. On top of that there is the control costFootnote 4 that penalizes for the size of the change from the initial parameter vector (\({\varvec{\theta }}^* = {\varvec{\theta }} - {\varvec{\theta }}^0\)), through the control cost matrix \(\varvec{R}\). It is rather low and serves here more as a regularization term that can prevent the parameter vector from becoming very large.

1.2 Details of the second simulation experiment

The second simulation experiment is performed with \(N_r = 20\) and 200 learning iterations. The variance of the exploration noise \({\varvec{{\varSigma }}}_{\epsilon }\) is set to obtain a mean norm of 0.1 for the velocity noise vector (see Sect. 4.1).

The cost function is similar to the cost functions of the first experiment. The only difference is that for the “via-points” cost term, two sets of via-points are distinguished, one in the right-half plane and on in the left half-plane of the state space. The “via-points” cost that is applied is the minimum between the cost relative to each set. This means that the trajectory only has to pass through one of the sets of via-points. The cost function is thus composed of the following components:

$$\begin{aligned}&{\varPhi }_{r} = ||{\varvec{x}}^{goal} - {\varvec{x}}_{t_{N_t}}||^2 + \min \limits _{b=1:2} \sum ^{N^{VP}_b}_{l = 1} \min \limits _{j=1:N_t}{||{\varvec{x}}^{via}_{l,b} - {\varvec{x}}_{t_j}||^2} \, , \nonumber \\&q_{t_i,r} = 0.0002 \cdot || \dot{{\varvec{x}}}_{t_i,r} - \dot{{\varvec{x}}}_{t_{i-1},r}|| \, ,\\&\quad \ \! \varvec{R} = 0.0005 \cdot I \, ,\nonumber \end{aligned}$$
(40)

where index b is for the via-points set.

Appendix 2: Additional details for Sect. 4

The cost function for for the constrained reaching task is given by the following terminal cost \({\varPhi }_{r}\), immediate cost \(q_{t_i,r}\) and control cost matrix \(\varvec{R}\):

$$\begin{aligned} \begin{aligned}&{\varPhi }_{r} = 10 \, ||{\varvec{x}}^{goal} - {\varvec{x}}_{t_{N_t}}||^2 \, , \\&q_{t_i,r} = 0.01 \,||\dot{{\varvec{x}}}_{t_i,r} - \dot{{\varvec{x}}}_{t_{i-1},r}|| + ||\dot{{\varvec{x}}}_{t_i,r}||\, \Delta t - \frac{8.815}{N_t} \, , \\&\quad \ \! \varvec{R} = 0.001 \, I \, . \end{aligned} \end{aligned}$$
(41)

Penalizing the norm of the velocity of each time step is equivalent to penalizing the length of the trajectory. The subtractive term is there to remove the minimal possible path length, which is 8.815, from the total cost. The norm of the acceleration is penalized to promote smooth trajectories.

Appendix 3: Additional details for Sect. 5

1.1 Augmented control matrix and parameter vector

Here is an example of the form that takes the matrix \(\varvec{G}\) and parameter vector \({\varvec{\theta }}\) when extended with the stiffness profile. This is for a 1-dimensional stiffness profile (i.e same stiffness applied to all dimensions) and a 2-D system:

$$\begin{aligned}&\varvec{G}_{x_t} = \begin{bmatrix} \varvec{G}_{x_t}^1&\varvec{G}_{x_{t_i}}^2&\dots&\varvec{G}_{x_t}^{N_G} \end{bmatrix} \end{aligned}$$
(42)
$$\begin{aligned}&\varvec{G}_{x_t}^k = h_{x_t}^k \begin{bmatrix}&\quad x_{1,t}&\quad x_{2,t}&\quad 0&\quad 0&\quad 0&\quad 0&\quad 1&\quad 0&\quad 0\\&\quad 0&\quad 0&\quad x_{1,t}&\quad x_{2,t}&\quad 0&\quad 0&\quad 0&\quad 1&\quad 0\\&\quad 0&\quad 0&\quad 0&\quad 0&\quad x_{1,t}&\quad x_{2,t}&\quad 0&\quad 0&\quad 1\\ \end{bmatrix} \end{aligned}$$
(43)
$$\begin{aligned}&{\varvec{\theta }} = \begin{bmatrix} {\varvec{\theta }}^1&{\varvec{\theta }}^2&\dots&{\varvec{\theta }}^{N_G} \end{bmatrix}^T \end{aligned}$$
(44)
$$\begin{aligned}&{\varvec{\theta }}^k = \begin{bmatrix} A_{1,1}^k&A_{1,2}^k&A_{2,1}^k&A_{2,2}^k&A_{3,1}^k&A_{3,2}^k&b_1^k&b_2^k&b_3^k\\ \end{bmatrix} \end{aligned}$$
(45)

The parameters that have to do with \(\varvec{s}\) are \(A_{3,1}^k\), \( A_{3,2}^k\), and \( b_3^k\).

1.2 About learning control parameters with PI2

Learning the stiffness profile for an impedance controller may seem to depart from the most common application of PI2, which is to learn policies in the form of reference trajectories. However, it is explicitly mentioned in Theodorou et al. (2010) that the concept of “action” in path integral optimal control has a broad sens and can be a control gain just as well as a desire state.

In order have the stiffness parameters \( \varvec{s}\) fit the form imposed by the PI2 formalism, these parameters must be seen as described by auxiliary ODEs (one per dimension of the stiffness) that are added to the set of ODEs that represent our point-mass system and our reference trajectory, as explained in Buchli et al. (2011). The auxiliary ODEs have the following form:

$$\begin{aligned} \dot{s^{}_{j}} = \alpha _k \left( {\varvec{g}^{s^{}_{j}}_{ x^{ref}}}^T\left( {\varvec{\theta }}^{s^{}_{j}} + \varvec{\epsilon }^{s^{}_{j}}\right) - s^{}_{j}\right) \end{aligned}$$
(46)

where the index j represents the dimension and in our 2-D case

$$\begin{aligned}&{\varvec{\theta }}^{s^{}_{j}} = \begin{bmatrix} A_{3,1}^j&A_{3,2}^j&b_3^j \end{bmatrix}^T \end{aligned}$$
(47)
$$\begin{aligned}&{\varvec{g}^{s^{}_{j}}_{ x^{ref}}} = \begin{bmatrix} h_{x^{ref} }^j\cdot (x_{1}^{ref}&x_{2}^{ref}&1) \end{bmatrix}^T \end{aligned}$$
(48)

The parameter \(\alpha _k\) of this auxiliary ODE are chosen very large so that the \(s^{}_{j}\) converges very fast to its final value \( {\varvec{g}^{s_{j}}_{ x^{ref}}}^T({\varvec{\theta }}^{s^{}_{}} + \varvec{\epsilon }^{s^{}_{j}}) \), i.e. much faster than the changes in \({\varvec{g}^{s_{j}}_{ x^{ref}}}^T\) (caused by the evolution of \( {\varvec{x}}^{ref}\)). We will thus make the assumption that for any practical purpose \(s^{}_{j} = {\varvec{g}^{s_{j}}_{ x^{ref}}}^T({\varvec{\theta }}^{s^{}_{j}} + \varvec{\epsilon }^{s^{}_{j}})\) and so by learning \({\varvec{\theta }}^{\varvec{s}} = [ {\varvec{\theta }}^{s^{}_{1}} \quad {\varvec{\theta }}^{s^{}_{2}} \dots ]\) we will learn the stiffness profile (as a function of the state).

1.3 Details of the simulations

The cost function for the first task is given by the following terminal cost \({\varPhi }_{r}\), immediate cost \(q_{t_i,r}\) and control cost matrix \(\varvec{R}\):

$$\begin{aligned} \begin{aligned}&{\varPhi }_{r} = ||{\varvec{x}}^{goal} - {\varvec{x}}_{t_{N_t}}||^2 + \sum ^{N_{VP}}_{l = 1} \min \limits _{j=1:N_t}{||{\varvec{x}}^{via}_l - {\varvec{x}}_{t_j}||^2} \\&q_{t_i,r}= 10^{-4} \cdot || \dot{{\varvec{x}}}_{t_i,r} - \dot{{\varvec{x}}}_{t_{i-1},r}|| + 5 \cdot 10^{-4} ||\varvec{s}_{t_i}||\\&\quad \ \! \varvec{R} = 10^{-6} \cdot I \end{aligned} \end{aligned}$$
(49)

The divergent force field increases linearly with the distance to the nominal path. It is modeled by the following equation:

$$\begin{aligned} \varvec{f}_{t_i} = 10\cdot \min _j {||{\varvec{x}}_{t_i} - {\varvec{x}}_{j}^{nom}||} \end{aligned}$$
(50)

where \([{\varvec{x}}_{1}^{nom} {\varvec{x}}_{2}^{nom} \dots {\varvec{x}}_{N^{nom}}^{nom}]\) is the discrete representation of the nominal path where the force is null (i.e. the ridge of the divergent force field).

The cost function the second task has the same structure as the one of the first experiment, with slightly different weights for the stiffness and acceleration costs:

$$\begin{aligned} \begin{aligned}&{\varPhi }_{r} = ||{\varvec{x}}^{goal} - {\varvec{x}}_{t_{N_t}}||^2 + \sum ^{N_{VP}}_{l = 1} \min \limits _{j=1:N_t}{||{\varvec{x}}^{via}_l - {\varvec{x}}_{t_j}||^2} \\&q_{t_i,r} = 10^{-3} \cdot || \dot{{\varvec{x}}}_{t_i,r} - \dot{{\varvec{x}}}_{t_{i-1},r}|| + 2 \cdot 10^{-4} ||\varvec{s}_{t_i}||\\&\quad \ \! \varvec{R} = 10^{-6} \cdot I. \end{aligned} \end{aligned}$$
(51)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rey, J., Kronander, K., Farshidian, F. et al. Learning motions from demonstrations and rewards with time-invariant dynamical systems based policies. Auton Robot 42, 45–64 (2018). https://doi.org/10.1007/s10514-017-9636-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10514-017-9636-y

Keywords

Navigation