# Discovering relevant task spaces using inverse feedback control

## Abstract

Learning complex skills by repeating and generalizing expert behavior is a fundamental problem in robotics. However, the usual approaches do not answer the question of what are appropriate representations to generate motion for a specific task. Since it is time-consuming for a human expert to manually design the motion control representation for a task, we propose to uncover such structure from data-observed motion trajectories. Inspired by Inverse Optimal Control, we present a novel method to learn a latent value function, imitate and generalize demonstrated behavior, and discover a task relevant motion representation. We test our method, called Task Space Retrieval Using Inverse Feedback Control (TRIC), on several challenging high-dimensional tasks. TRIC learns the important control dimensions for the tasks from a few example movements and is able to robustly generalize to new situations.

This is a preview of subscription content, log in via an institution to check access.

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

## Notes

1. To simplify even more we use $$t_0 =1$$ and $$t_T = T$$.

2. Note that the demonstrations may be from different situations, with objects placed on different locations. Since the features $$y_i^t$$ may be object relative they cannot be captured from $$q_t^i$$ alone—which we neglected in our notation. We therefore “record” also $$y_t^i$$ and the Jacobians $$\dfrac{\partial \phi }{\partial q}(q_t^i)$$ for all demonstrations. If it is clear from the context that we are dealing with just one trajectory, we will skip the superscript $$i$$ and write just $$q_t$$ instead of $$q_t^i$$.

3. Here we write $$\dfrac{\partial f}{\partial y}$$ instead of $$\dfrac{\partial f}{\partial \phi }$$, because $$y = \phi (q)$$.

4. It is still decreasing despite the gradient sign change because of other features coupled geometrically to $$p_{3,1}^y$$.

5. Implicit surface object models are learned from sensory data and the object surface is a nonlinear function potential itself, e.g. a Gaussian Process or SVR, see Steinke et al. (2005).

## References

• Argall, B. D., Chernova, S., Veloso, M. M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5), 469–483.

• Bain, M., and Sammut, C. (1996). A framework for behavioural cloning. In Machine intelligence, vol 15 (pp. 103–129). Oxford: Oxford University Press.

• Berniker, M., & Kording, K. (2008). Estimating the sources of motor errors for adaptation and generalization. Nature Neuroscience, 11(12), 1454–1461.

• Billard, A., Epars, Y., Calinon, S., Cheng, G., & Schaal, S. (2004). Discovering optimal imitation strategies. Robotics and Autonomous Systems, Special Issue: Robot Learning from Demonstration, 47(2–3), 69–77.

• Calinon, S., and Billard, A. (2007). Incremental learning of gestures by imitation in a humanoid robot. In HRI ’07: Proceedings of the ACM/IEEE International Conference on Human–Robot Interaction (pp. 255–262).

• Call, J., and Carpenter, M. (2002). Three sources of information in social learning. Imitation in animals and artifacts, pp. 211–228.

• Craig, J. J. (1989). Introduction to robotics: Mechanics and control (2nd ed.). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. ISBN 0201095289.

• Dragiev, S., Toussaint, M., and Gienger, M. (2011). Gaussian process implict surface for object estimation and grasping. In IEEE International Conference on Robotics and Automation (ICRA).

• Gienger, M., Toussaint, M., Jetchev, N., Bendig, A., and Goerick, C. (2008). Optimization of fluent approach and grasp motions. In 8th IEEE-RAS International Conference on Humanoid Robots.

• Haindl, M., Somol, P., Ververidis, D., and Kotropoulos, C. (2006). Feature selection based on mutual correlation. In CIARP, pp. 569–577.

• Hiraki, K., Sashima, A., & Phillips, S. (1998). From egocentric to allocentric spatial behavior: A computational model of spatial development. Adaptive Behavior, 6(3–4), 371–391.

• Ho, E. S. L., Komura, T., & Tai, C.-L. (2010). Spatial relationship preserving character motion adaptation. ACM Transactions on Graphics, 29(4), 1–8.

• Howard, M., Klanke, S., Gienger, M., Goerick, C., & Vijayakumar, S. (2009). A novel method for learning policies from variable constraint data. Autonomous Robots, 27, 105–121.

• Jenkins, O. C., and Matarić, M. J. (2004). A spatio-temporal extension to isomap nonlinear dimension reduction. In 21st International Conference on Machine Learning (ICML).

• Jetchev, N. (2012). Learning representations from motion trajectories: Analysis and applications to robot planning and control. PhD thesis, FU Berlin. Retrieved from http://www.diss.fu-berlin.de/diss/receive/FUDISS_thesis_000000037417. Accessed 1 Aug 2012.

• Jetchev N., and Toussaint, M. (2011). Task space retrieval using inverse feedback control. In 28th International Conference on Machine Learning (ICML), pp. 449–456.

• Khansari-Zadeh, S. M., and Billard, A. (2010). Bm: An iterative algorithm to learn stable non-linear dynamical systems with gaussian mixture models. In IEEE International Conference on Robotics and Automation (ICRA), pp. 2381–2388.

• Kroemer, O., Detry, R., Piater, J. H., and Peters, J. (2009). Active learning using mean shift optimization for robot grasping. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2610–2615.

• Kroemer, O., Detry, R., Piater, J. H., & Peters, J. (2010). Grasping with vision descriptors and motor primitives. ICINCO, 2, 47–54.

• LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., & Huang, F. (2006). A tutorial on energy-based learning. In Predicting structured data.

• Montavon, G., Braun, M., & Müller, K.-R. (2011). Kernel analysis of deep networks. Journal of Machine Learning Research, 12, 2563–2581.

• Muehlig, M., Gienger, M., Steil, J. J., and Goerick, C. (2009). Automatic selection of task spaces for imitation learning. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4996–5002.

• Myers, C. S., & Rabiner, L. R. (1981). A comparative study of several dynamic time-warping algorithms for connected word recognition. The Bell System Technical Journal, 60(7), 1389–1409.

• Nouri, A., & Littman, M. L. (2010). Dimension reduction and its application to model-based exploration in continuous spaces. Machine Learning, 81(1), 85–98.

• Perkins, T. J., & Barto, A. G. (2002). Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research, 3, 803–832.

• Pomerleau, D. A. (1991). Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3, 88–97.

• Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.

• Ratliff, N., Ziebart, B., Peterson, K., Bagnell, J. A., Hebert, M., Dey, A. K., and Srinivasa, S. (2009). Inverse optimal heuristic control for imitation learning. In Proceedings of AISTATS, pp. 424–431.

• Ratliff, N. D., Bagnell, J. A., and Zinkevich, M. A. (2006). Maximum margin planning. In 26th International Conference on Machine Learning (ICML), pp. 729–736.

• Schaal, S., Peters, J., Nakanishi, J., and Ijspeert, A. J. (2003). Learning movement primitives. In International Symposium on Robotics Research, pp. 561–572.

• Siciliano, B., & Khatib, O. (Eds.). (2008). Springer handbook of robotics. Berlin: Springer.

• Slotine, J.-J., & Li, W. (1991). Applied nonlinear control. Upper Saddle River, NJ: Prentice Hall.

• Steinke, F., Schölkopf, B., & Blanz, V. (2005). Support vector machines for 3D shape processing. Computer Graphics Forum, 24(3), 285–294.

• Tegin, J., Ekvall, S., Kragic, D., Wikander, J., & Iliev, B. (2009). Demonstration-based learning and control for automatic grasping. Intelligent Service Robotics, 2, 23–30.

• Toussaint, M. (2009). Robot trajectory optimization using approximate inference. In 26th International Conference on Machine Learning (ICML), pp. 1049–1056.

• Toussaint, M. (2011). Robotics. University Lecture, 2011. Retrieved from http://userpage.fu-berlin.de/mtoussai/teaching/11-Robotics/. Accessed 1 Aug 2012.

• Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.

• Ude, A., Gams, A., Asfour, T., & Morimoto, J. (2010). Task-specific generalization of discrete and periodic dynamic movement primitives. IEEE Transactions on Robotics, 26(5), 800–815.

• Wagner, T., Visser, U., & Herzog, O. (2004). Egocentric qualitative spatial knowledge representation for physical robots. Robotics and Autonomous Systems, 49(1–2), 25–42.

## Acknowledgments

This work was supported by the German Research Foundation (DFG), Emmy Noether fellowship TO 409/1-3, and the EU FP7 project TOMSY.

## Author information

Authors

### Corresponding author

Correspondence to Nikolay Jetchev.

## Appendix

### 1.1 Proof of proposition 1 for the direction of IK generated motion steps

Proposition 1 If $$\varrho \rightarrow \infty$$ then the IK solution $$q_{t+1}$$ minimizing Eq. (9) has the property that the next step $$q_{t+1} - q_{t}$$ is approximately proportional to the value function gradient $$\mathcal {J}$$ in a small region around $$q_t$$.

### Proof

If $$\varrho \rightarrow \infty$$ then the term $$||f \circ \phi (q) - f \circ \phi (q_t) + \delta ||^2$$ of Eq. (9) is weighted so high that $$C_{prior}$$ and any other cost terms we might add are neglected. Let $$\mathcal {J}$$ be the gradient of the value function $$f \circ \phi (q)$$ evaluated at $$q = q_t$$. Using the linearization $$f \circ \phi (q_{t+1}) = f \circ \phi (q_t) + \mathcal {J}(q_{t+1} - q_t)$$, we can apply the IK Equation (Toussaint 2011):

\begin{aligned} q_{t+1}&= q_t - \delta \mathcal {J^{\sharp }}\\ \mathcal {J^{\sharp }}&= {\left( \varrho \mathcal {J}^T\mathcal {J} + \mathbb {I} \right) }^{-1}\mathcal {J}^T\varrho \\&= \mathcal {J}^T {\left( \mathcal {J}\mathcal {J}^T + {\varrho }^{-1}\right) }^{-1} = \frac{1}{||\mathcal {J}||^2}\mathcal {J}^T \end{aligned}

We have used the Woodbury identity and the fact that $$\mathcal {J}\mathcal {J}^T = ||\mathcal {J}||^2$$ in the case where we have a 1-dimensional task variable $$y$$ (in that case the Jacobian is a row vector gradient). $$\mathcal {J^{\sharp }}$$ is called the pseudoinverse of $$\mathcal {J}$$. Thus, the steps generated by our motion model are proportional to $$\mathcal {J}$$ times a negative scalar number.$$\square$$

### 1.2 Proof of proposition 2 for Lyapunov attractor properties of TRIC

Proposition 2 Suppose we have trained TRIC on a single trajectory $$\{q_t, y_t\}_{t=1}^T$$ and that $$f \circ \phi (q_T)$$ is a minimum of the value function. Additionally, we generate motion with $$\varrho \rightarrow \infty$$, i.e. very high weighting of the value function. Then the motion generated by the model in Eq. (8) fulfills the conditions of Theorem 1 and is thus asymptotically stable at the attractor subspace $$Q' = \{q': \phi (q') = \phi (q_T) = y_T \}$$.

### Proof

Because of $$\varrho \rightarrow \infty$$ the term decreasing the value function $$f$$ will dominate the motion equation and we can ignore the effect of the other terms. Let’s construct $$V(q) = f\circ \phi (q) - c_T$$, where $$c_T = f\circ \phi (q_T)$$. Then the Lyapunov stability conditions hold:

• (a) holds directly because of the assumption that $$c_T= f\circ \phi (q_T)$$ is minimum and any other joint state $$q$$ s.t. $$\phi (q) \ne \phi (q_T)$$ will have higher value $$f\circ \phi (q)$$ than it.

• (b) holds by the construction of $$V(q)$$ directly implying that

\begin{aligned} V(q_T) = f\circ \phi (q_T) - c_T = 0 \end{aligned}
• Proposition 1 holds because we assumed that $$\varrho \rightarrow \infty$$. This implies that the steps of the motion model are proportional to the gradient $$\mathcal {J}$$. Thus the motion model of Eq. (9) will make steps decreasing the value $$f\circ \phi (q_t)$$ constantly and (c) holds.

• (d) holds because $$c_T= f\circ \phi (q_T)$$ is local minimum and after we reach a joint state $$q'$$ s.t. $$\phi (q') = \phi (q_T)$$ the gradient of the value function will be 0 and no further decrease will be possible.$$\square$$

## Rights and permissions

Reprints and permissions

Jetchev, N., Toussaint, M. Discovering relevant task spaces using inverse feedback control. Auton Robot 37, 169–189 (2014). https://doi.org/10.1007/s10514-014-9384-1