Learning motions from demonstrations and rewards with time-invariant dynamical systems based policies

Rey, Joel; Kronander, Klas; Farshidian, Farbod; Buchli, Jonas; Billard, Aude

doi:10.1007/s10514-017-9636-y

Learning motions from demonstrations and rewards with time-invariant dynamical systems based policies

Published: 04 May 2017

Volume 42, pages 45–64, (2018)
Cite this article

Autonomous Robots Aims and scope Submit manuscript

Joel Rey¹,
Klas Kronander ORCID: orcid.org/0000-0001-9528-770X¹,
Farbod Farshidian²,
Jonas Buchli² &
…
Aude Billard¹

1339 Accesses
19 Citations
Explore all metrics

Abstract

An important challenge when using reinforcement learning for learning motions in robotics is the choice of parameterization for the policy. We use Gaussian Mixture Regression to extract a parameterization with relevant non-linear features from a set of demonstrations of a motion following the paradigm of learning from demonstration. The resulting parameterization takes the form of a non-linear time-invariant dynamical system (DS). We use this time-invariant DS as a parameterized policy for a variant of the PI² policy search algorithm. This paper contributes by adapting PI² for our time-invariant motion representation. We introduce two novel parameter exploration schemes that can be used to (1) sample model parameters to achieve a uniform exploration in state space and (2) explore while ensuring stability of the resulting motion model. Additionally, a state dependent stiffness profile is learned simultaneously to the reference trajectory and both are used together in a variable impedance control architecture. This learning architecture is validated in a hardware experiment consisting of a digging task using a KUKA LWR platform.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Task Dependent Trajectory Learning from Multiple Demonstrations Using Movement Primitives

Model Predictive Motion Control based on Generalized Dynamical Movement Primitives

Article 23 September 2014

Learning from Demonstration Using Variational Bayesian Inference

Notes

For the 3-dimensional case, the norm follows a Maxwell–Boltzmann distribution with mean $\mu _v = 2\tilde{\sigma }\sqrt{\frac{2}{\pi }}$.
Note that the divergent force field is a perturbation that has nothing to do with the exploration noise. The divergent force field is not stochastic, and it is part of the system.
This method can improve the robustness by preventing problematic situations where none of the roll-outs yield good performance.
Although for our parameterization the size of the parameters is only loosely related to the actual control that will be applied to the system, we keep the conventional name “control cost” for this term.

References

Ajoudani, A., Tsagarakis, N., & Bicchi, A. (2012). Tele-impedance: Teleoperation with impedance regulation using a body-machine interface. The International Journal of Robotics Research, 31(13), 1642–1656.
Article Google Scholar
Billard, A., Calinon, S., Dillmann, R., & Schaal, S. (2008). Handbook of Robotics Chapter 59: Robot Programming by Demonstration. In Handbook of Robotics. Springer.
Buchli, J., Stulp, F., Theodorou, E., & Schaal, S. (2011). Learning variable impedance control. The International Journal of Robotics Research, 30(7), 820–833.
Article Google Scholar
Burdet, E., Osu, R., Franklin, D. W., Milner, T. E., & Kawato, M. (2001). The central nervous system stabilizes unstable dynamics by learning optimal impedance. Nature, 414(6862), 4469.
Article Google Scholar
Calinon, S., Bruno, S., & Caldwell, D.G. (2014). A task-parameterized probabilistic model with minimal intervention control. In IEEE International Conference on Robotics and Automation (pp. 3339–3344).
Calinon, S., Sardellitti, I., Caldwell, D. (2010). Learning-based control strategy for safe human-robot interaction exploiting task and robot redundancies. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 249–254).
Calinon, S., D’halluin, F., Sauser, E. L., Caldwell, D. G., & Billard, A. G. (2010). Learning and reproduction of gestures by imitation. Robotics & Automation Magazine, IEEE, 17(2), 44–54.
Article Google Scholar
Calinon, S., Kormushev, P., & Caldwell, D. G. (2013). Compliant skills acquisition and multi-optima policy search with EM-based reinforcement learning. Robotics and Autonomous Systems, 61(4), 369–379.
Article Google Scholar
Daniel, C., Neumann, G., & Peters, J. (2012). Learning concurrent motor skills in versatile solution spaces. In Intelligent Robots and Systems (IROS), IEEE/RSJ International Conference on 2012. IEEE, (pp. 3591–3597).
Farshidian, F., Neunert, M., & Buchli, J. (2014). Learning of closed-loop motion control. In IEEE International Conference on Intelligent Robots and Systems, no. IROS (pp. 1441–1446).
Garabini, M., Passaglia, A., Belo, F., Salaris, P., & Bicchi, A. (2012). Optimality principles in stiffness control: The VSA kick. In IEEE International Conference on Robotics and Automation (pp. 3341–3346).
Gribovskaya, E., Khansari-Zadeh, S. M., & Billard, A. (2010). Learning non-linear multivariate dynamics of motion in robotic manipulators. The International Journal of Robotics Research, 30(1), 80–117.
Article Google Scholar
Guenter, F., Hersch, M., Calinon, S., & Billard, A. (2007). Reinforcement learning for imitating constrained reaching movements. Advanced Robotics, 21(13), 1521–1544.
Google Scholar
Gullapalli, V., Franklin, J. A., & Benbrahim, H. (1994). Acquiring robot skills via reinforcement learning. Control Systems, IEEE, 14(1), 13–24.
Article Google Scholar
Hogan, N. (1985). Impedance control: An approach to manipulation. Journal of Dynamic Systems Measurement and Control, 107(12), 1–24.
Article MATH Google Scholar
Howard, M., Braun, D. J., & Vijayakumar, S. (2013). Transferring human impedance behavior to heterogeneous variable impedance actuators. IEEE Transactions on Robotics, 29(4), 847–862.
Article Google Scholar
Ijspeert, A.J., Nakanishi, J., & Schaal, S. (2002). Movement imitation with nonlinear dynamical systems in humanoid robots. In Proceedings of the 2002 IEEE International Conference on Robotics and Automation, IEEE (Vol. 2, pp. 1398–1403).
Khansari-Zadeh, S. M., & Billard, A. (2011). Learning stable non-linear dynamical systems with Gaussian Mixture Models. IEEE Transactions on Robotics, 27, 1–15.
Article Google Scholar
Khansari-Zadeh, S. M., & Billard, A. (2011). Learning stable nonlinear dynamical systems with gaussian mixture models. IEEE Transactions on Robotics, 27(5), 943–957.
Article Google Scholar
Kober, J., & Peters, J. (2009). Learning motor primitives for robotics. In IEEE International Conference on Robotics and Automation, 2009, ICRA’09, IEEE (pp. 2112–2118).
Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32, 1238–1274.
Article Google Scholar
Kober, J., & Peters, J. (2010). Policy search for motor primitives in robotics. Machine Learning, 84(1–2), 171–203.
MathSciNet MATH Google Scholar
Kober, J., & Peters, J. (2010). Imitation and reinforcement learning. IEEE Robotics Automation Magazine, 17(2), 55–62.
Article MATH Google Scholar
Kronander, K., Khansari-Zadeh, S. M., & Billard, A. (2015). Incremental motion learning with locally modulated dynamical systems. Robotics and Autonomous Systems, 70, 52–62.
Article Google Scholar
Kronander, K., & Billard, A. (2013). Learning compliant manipulation through kinesthetic and tactile human-robot interaction. Transactions on Haptics, 7(3), 1–16.
Google Scholar
Kronander, K., & Billard, A. (2016). Passive interaction control with dynamical systems. Robotics and Automation Letters, 1(1), 106–113.
Article Google Scholar
Lee, A. X., Lu, H., Gupta, A., Levine, S., & Abbeel, P. (2015). Learning force-based manipulation of deformable objects from multiple demonstrations. In IEEE International Conference on Robotics and Automation.
Lemme, A., Neumann, K., Reinhart, R., & Steil, J. (2014). Neural learning of vector fields for encoding stable dynamical systems. Neurocomputing, 141, 3–14.
Article Google Scholar
Medina, J., Sieber, D., & Hirche, S. (2013). Risk-sensitive interaction control in uncertain manipulation tasks. In IEEE International Conference on Robotics and Automation.
Mitrovic, D., Klanke, S., & Vijayakumar, S. (2011). Learning impedance control of antagonistic systems based on stochastic optimization principles. The International Journal of Robotics Research, 30(5), 556–573.
Article Google Scholar
Paraschos, A., Daniel, C., Peters, J., & Neumann, G. (2013). Probabilistic movement primitives. Neural Information Processing Systems (pp. 1–9).
Pastor, P., Righetti, L., Kalakrishnan, M., & Schaal, S. (2011). Online movement adaptation based on previous sensor experiences. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2011. IEEE (pp. 365–371).
Peters, J., & Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71(7–9), 1180–1190.
Article Google Scholar
Rozo, L., Calinon, S., Caldwell, D., Jimenez, P., Torras, C., & Jiménez, P. (2013). Learning collaborative impedance-based robot behaviors. In AAAI Conference on Artificial Intelligence.
Rückert, E. A., Neumann, G., Toussaint, M., & Maass, W. (2013). Learned graphical models for probabilistic planning provide a new class of movement primitives. Frontiers in Computational Neuroscience, 6(January), 1–20.
Google Scholar
Schaal, S., Ijspeert, A., & Billard, A. (2003). Computational approaches to motor learning by imitation. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 358(1431), 53747.
Article Google Scholar
Selen, L. P. J., Franklin, D. W., & Wolpert, D. M. (2009). Impedance control reduces instability that arises from motor noise. The Journal of Neuroscience, 29(40), 1260616.
Article Google Scholar
Stulp, F., & Sigaud, O. (2012). Policy improvement methods: Between black-box optimization and episodic reinforcement learning.
Stulp, F., Sigaud, O. (2012). Path integral policy improvement with covariance matrix adaptation. In Proceedings of the 29th International Conference on Machine Learning (ICML-12) (pp. 281–288).
Sung, H.G. (2004). Gaussian mixture regression and classification (Ph.D. dissertation, Rice University).
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. Learning, 9(1), 1–23.
Google Scholar
Tedrake, R., Zhang, T. W., & Seung, H.S. (2004). Stochastic policy gradient reinforcement learning on a simple 3d biped. In Intelligent Robots and Systems, 2004. (IROS 2004). Proceedings of IEEE/RSJ International Conference on 2004. IEEE (Vol. 3, pp. 2849–2854).
Tee, K. P., Franklin, D. W., Kawato, M., Milner, T. E., Burdet, E., Peng, K., et al. (2010). Concurrent adaptation of force and impedance in the redundant muscle system. Biological Cybernetics, 102(1), 31–44.
Article MATH Google Scholar
Theodorou, E., Buchli, J., & Schaal, S. (2010). A generalized path integral control approach to reinforcement learning. The Journal of Machine Learning Research, 11, 3137–3181.
MathSciNet MATH Google Scholar
Thijssen, S., & Kappen, H. (2015). Path integral control and state-dependent feedback. Physical Review E, 91(3), 032104.
Article MathSciNet Google Scholar
Toussaint, M. (2009). Probabilistic inference as a model of planned behavior. Künstliche Intelligenz, 3(9), 23–29.
Google Scholar
Vlassis, N., Toussaint, M., Kontes, G., & Piperidis, S. (2009). Learning model-free robot control by a monte carlo EM algorithm. Autonomous Robots, 27(2), 123–130.
Article Google Scholar
Yang, C., Ganesh, G., Haddadin, S., Parusel, S., Albu-Schaffer, A., & Burdet, E. (2011). Human-like adaptation of force and impedance in stable and unstable interactions. IEEE Transactions on Robotics, 27(5), 918–930.
Article Google Scholar

Download references

Acknowledgements

This research was funded by the European Union Seventh Framework Programme FP7/2007-2013 under Grant Agreement No. 288533 ROBOHOW.COG and by the Swiss National Science Foundation through the National Center of Competence in Research Robotics.

Author information

Authors and Affiliations

Learning Algorithms ans Systems Laboratory (LASA), EPFL, Lausanne, Switzerland
Joel Rey, Klas Kronander & Aude Billard
Agile and Dexterous Robotics Lab (ADRL), ETHZ, Zurich, Switzerland
Farbod Farshidian & Jonas Buchli

Authors

Joel Rey
View author publications
You can also search for this author in PubMed Google Scholar
Klas Kronander
View author publications
You can also search for this author in PubMed Google Scholar
Farbod Farshidian
View author publications
You can also search for this author in PubMed Google Scholar
Jonas Buchli
View author publications
You can also search for this author in PubMed Google Scholar
Aude Billard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Klas Kronander.

Appendices

Appendix 1: Additional details for Sect. 3

1.1 Details of the first simulation experiment

The first simulation experiment is performed with $N_r = 10$ and 50 learning iterations. The variance of the exploration noise ${\varvec{{\varSigma }}}_{\epsilon }$ is set to obtain a mean norm of 0.1 for the velocity noise vector (see Sect. 4.1).

The cost is composed of the following components:

$$\begin{aligned} \begin{aligned}&{\varPhi }_{r} = ||{\varvec{x}}^{goal} - {\varvec{x}}_{t_{N_t}}||^2 + \sum ^{N_{VP}}_{l = 1} \min \limits _{j=1:N_t}{||{\varvec{x}}^{via}_l - {\varvec{x}}_{t_j}||^2} \, , \\&q_{t_i,r} = 0.0001 \cdot || \dot{{\varvec{x}}}_{t_i,r} - \dot{{\varvec{x}}}_{t_{i-1},r}|| \, , \\&\quad \ \! \varvec{R} = 0.0005 \cdot I \, . \end{aligned} \end{aligned}$$

(39)

At each time step, the immediate cost penalizes the acceleration to promote smoother trajectories. The final cost is the sum of the squared minimum distance to each via-point along the trajectory, plus the squared distance to the goal at the final time step. Unlike the via-points, the goal has to be reached at a specific time, i.e. at the end of the trajectory. On top of that there is the control cost^{Footnote 4} that penalizes for the size of the change from the initial parameter vector (${\varvec{\theta }}^* = {\varvec{\theta }} - {\varvec{\theta }}^0$), through the control cost matrix $\varvec{R}$. It is rather low and serves here more as a regularization term that can prevent the parameter vector from becoming very large.

1.2 Details of the second simulation experiment

The second simulation experiment is performed with $N_r = 20$ and 200 learning iterations. The variance of the exploration noise ${\varvec{{\varSigma }}}_{\epsilon }$ is set to obtain a mean norm of 0.1 for the velocity noise vector (see Sect. 4.1).

The cost function is similar to the cost functions of the first experiment. The only difference is that for the “via-points” cost term, two sets of via-points are distinguished, one in the right-half plane and on in the left half-plane of the state space. The “via-points” cost that is applied is the minimum between the cost relative to each set. This means that the trajectory only has to pass through one of the sets of via-points. The cost function is thus composed of the following components:

$$\begin{aligned}&{\varPhi }_{r} = ||{\varvec{x}}^{goal} - {\varvec{x}}_{t_{N_t}}||^2 + \min \limits _{b=1:2} \sum ^{N^{VP}_b}_{l = 1} \min \limits _{j=1:N_t}{||{\varvec{x}}^{via}_{l,b} - {\varvec{x}}_{t_j}||^2} \, , \nonumber \\&q_{t_i,r} = 0.0002 \cdot || \dot{{\varvec{x}}}_{t_i,r} - \dot{{\varvec{x}}}_{t_{i-1},r}|| \, ,\\&\quad \ \! \varvec{R} = 0.0005 \cdot I \, ,\nonumber \end{aligned}$$

(40)

where index b is for the via-points set.

Appendix 2: Additional details for Sect. 4

The cost function for for the constrained reaching task is given by the following terminal cost ${\varPhi }_{r}$, immediate cost $q_{t_i,r}$ and control cost matrix $\varvec{R}$:

$$\begin{aligned} \begin{aligned}&{\varPhi }_{r} = 10 \, ||{\varvec{x}}^{goal} - {\varvec{x}}_{t_{N_t}}||^2 \, , \\&q_{t_i,r} = 0.01 \,||\dot{{\varvec{x}}}_{t_i,r} - \dot{{\varvec{x}}}_{t_{i-1},r}|| + ||\dot{{\varvec{x}}}_{t_i,r}||\, \Delta t - \frac{8.815}{N_t} \, , \\&\quad \ \! \varvec{R} = 0.001 \, I \, . \end{aligned} \end{aligned}$$

(41)

Penalizing the norm of the velocity of each time step is equivalent to penalizing the length of the trajectory. The subtractive term is there to remove the minimal possible path length, which is 8.815, from the total cost. The norm of the acceleration is penalized to promote smooth trajectories.

Appendix 3: Additional details for Sect. 5

1.1 Augmented control matrix and parameter vector

Here is an example of the form that takes the matrix $\varvec{G}$ and parameter vector ${\varvec{\theta }}$ when extended with the stiffness profile. This is for a 1-dimensional stiffness profile (i.e same stiffness applied to all dimensions) and a 2-D system:

$$\begin{aligned}&\varvec{G}_{x_t} = \begin{bmatrix} \varvec{G}_{x_t}^1&\varvec{G}_{x_{t_i}}^2&\dots&\varvec{G}_{x_t}^{N_G} \end{bmatrix} \end{aligned}$$

(42)

$$\begin{aligned}&\varvec{G}_{x_t}^k = h_{x_t}^k \begin{bmatrix}&\quad x_{1,t}&\quad x_{2,t}&\quad 0&\quad 0&\quad 0&\quad 0&\quad 1&\quad 0&\quad 0\\&\quad 0&\quad 0&\quad x_{1,t}&\quad x_{2,t}&\quad 0&\quad 0&\quad 0&\quad 1&\quad 0\\&\quad 0&\quad 0&\quad 0&\quad 0&\quad x_{1,t}&\quad x_{2,t}&\quad 0&\quad 0&\quad 1\\ \end{bmatrix} \end{aligned}$$

(43)

$$\begin{aligned}&{\varvec{\theta }} = \begin{bmatrix} {\varvec{\theta }}^1&{\varvec{\theta }}^2&\dots&{\varvec{\theta }}^{N_G} \end{bmatrix}^T \end{aligned}$$

(44)

$$\begin{aligned}&{\varvec{\theta }}^k = \begin{bmatrix} A_{1,1}^k&A_{1,2}^k&A_{2,1}^k&A_{2,2}^k&A_{3,1}^k&A_{3,2}^k&b_1^k&b_2^k&b_3^k\\ \end{bmatrix} \end{aligned}$$

(45)

The parameters that have to do with $\varvec{s}$ are $A_{3,1}^k$, $ A_{3,2}^k$, and $ b_3^k$.

1.2 About learning control parameters with PI²

Learning the stiffness profile for an impedance controller may seem to depart from the most common application of PI², which is to learn policies in the form of reference trajectories. However, it is explicitly mentioned in Theodorou et al. (2010) that the concept of “action” in path integral optimal control has a broad sens and can be a control gain just as well as a desire state.

In order have the stiffness parameters $ \varvec{s}$ fit the form imposed by the PI² formalism, these parameters must be seen as described by auxiliary ODEs (one per dimension of the stiffness) that are added to the set of ODEs that represent our point-mass system and our reference trajectory, as explained in Buchli et al. (2011). The auxiliary ODEs have the following form:

$$\begin{aligned} \dot{s^{}_{j}} = \alpha _k \left( {\varvec{g}^{s^{}_{j}}_{ x^{ref}}}^T\left( {\varvec{\theta }}^{s^{}_{j}} + \varvec{\epsilon }^{s^{}_{j}}\right) - s^{}_{j}\right) \end{aligned}$$

(46)

where the index j represents the dimension and in our 2-D case

$$\begin{aligned}&{\varvec{\theta }}^{s^{}_{j}} = \begin{bmatrix} A_{3,1}^j&A_{3,2}^j&b_3^j \end{bmatrix}^T \end{aligned}$$

(47)

$$\begin{aligned}&{\varvec{g}^{s^{}_{j}}_{ x^{ref}}} = \begin{bmatrix} h_{x^{ref} }^j\cdot (x_{1}^{ref}&x_{2}^{ref}&1) \end{bmatrix}^T \end{aligned}$$

(48)

The parameter $\alpha _k$ of this auxiliary ODE are chosen very large so that the $s^{}_{j}$ converges very fast to its final value $ {\varvec{g}^{s_{j}}_{ x^{ref}}}^T({\varvec{\theta }}^{s^{}_{}} + \varvec{\epsilon }^{s^{}_{j}}) $, i.e. much faster than the changes in ${\varvec{g}^{s_{j}}_{ x^{ref}}}^T$ (caused by the evolution of $ {\varvec{x}}^{ref}$). We will thus make the assumption that for any practical purpose $s^{}_{j} = {\varvec{g}^{s_{j}}_{ x^{ref}}}^T({\varvec{\theta }}^{s^{}_{j}} + \varvec{\epsilon }^{s^{}_{j}})$ and so by learning ${\varvec{\theta }}^{\varvec{s}} = [ {\varvec{\theta }}^{s^{}_{1}} \quad {\varvec{\theta }}^{s^{}_{2}} \dots ]$ we will learn the stiffness profile (as a function of the state).

1.3 Details of the simulations

The cost function for the first task is given by the following terminal cost ${\varPhi }_{r}$, immediate cost $q_{t_i,r}$ and control cost matrix $\varvec{R}$:

$$\begin{aligned} \begin{aligned}&{\varPhi }_{r} = ||{\varvec{x}}^{goal} - {\varvec{x}}_{t_{N_t}}||^2 + \sum ^{N_{VP}}_{l = 1} \min \limits _{j=1:N_t}{||{\varvec{x}}^{via}_l - {\varvec{x}}_{t_j}||^2} \\&q_{t_i,r}= 10^{-4} \cdot || \dot{{\varvec{x}}}_{t_i,r} - \dot{{\varvec{x}}}_{t_{i-1},r}|| + 5 \cdot 10^{-4} ||\varvec{s}_{t_i}||\\&\quad \ \! \varvec{R} = 10^{-6} \cdot I \end{aligned} \end{aligned}$$

(49)

The divergent force field increases linearly with the distance to the nominal path. It is modeled by the following equation:

$$\begin{aligned} \varvec{f}_{t_i} = 10\cdot \min _j {||{\varvec{x}}_{t_i} - {\varvec{x}}_{j}^{nom}||} \end{aligned}$$

(50)

where $[{\varvec{x}}_{1}^{nom} {\varvec{x}}_{2}^{nom} \dots {\varvec{x}}_{N^{nom}}^{nom}]$ is the discrete representation of the nominal path where the force is null (i.e. the ridge of the divergent force field).

The cost function the second task has the same structure as the one of the first experiment, with slightly different weights for the stiffness and acceleration costs:

$$\begin{aligned} \begin{aligned}&{\varPhi }_{r} = ||{\varvec{x}}^{goal} - {\varvec{x}}_{t_{N_t}}||^2 + \sum ^{N_{VP}}_{l = 1} \min \limits _{j=1:N_t}{||{\varvec{x}}^{via}_l - {\varvec{x}}_{t_j}||^2} \\&q_{t_i,r} = 10^{-3} \cdot || \dot{{\varvec{x}}}_{t_i,r} - \dot{{\varvec{x}}}_{t_{i-1},r}|| + 2 \cdot 10^{-4} ||\varvec{s}_{t_i}||\\&\quad \ \! \varvec{R} = 10^{-6} \cdot I. \end{aligned} \end{aligned}$$

(51)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rey, J., Kronander, K., Farshidian, F. et al. Learning motions from demonstrations and rewards with time-invariant dynamical systems based policies. Auton Robot 42, 45–64 (2018). https://doi.org/10.1007/s10514-017-9636-y

Download citation

Received: 12 August 2015
Accepted: 18 April 2017
Published: 04 May 2017
Issue Date: January 2018
DOI: https://doi.org/10.1007/s10514-017-9636-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning motions from demonstrations and rewards with time-invariant dynamical systems based policies

Abstract

Access this article

Similar content being viewed by others

Task Dependent Trajectory Learning from Multiple Demonstrations Using Movement Primitives

Model Predictive Motion Control based on Generalized Dynamical Movement Primitives

Learning from Demonstration Using Variational Bayesian Inference

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Additional details for Sect. 3

1.1 Details of the first simulation experiment

1.2 Details of the second simulation experiment

Appendix 2: Additional details for Sect. 4

Appendix 3: Additional details for Sect. 5

1.1 Augmented control matrix and parameter vector

1.2 About learning control parameters with PI²

1.3 Details of the simulations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning motions from demonstrations and rewards with time-invariant dynamical systems based policies

Abstract

Access this article

Similar content being viewed by others

Task Dependent Trajectory Learning from Multiple Demonstrations Using Movement Primitives

Model Predictive Motion Control based on Generalized Dynamical Movement Primitives

Learning from Demonstration Using Variational Bayesian Inference

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Additional details for Sect. 3

1.1 Details of the first simulation experiment

1.2 Details of the second simulation experiment

Appendix 2: Additional details for Sect. 4

Appendix 3: Additional details for Sect. 5

1.1 Augmented control matrix and parameter vector

1.2 About learning control parameters with PI2

1.3 Details of the simulations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

1.2 About learning control parameters with PI²