Learning the optimal state-feedback via supervised imitation learning

Abstract

Imitation learning is a control design paradigm that seeks to learn a control policy reproducing demonstrations from expert agents. By substituting expert demonstrations for optimal behaviours, the same paradigm leads to the design of control policies closely approximating the optimal state-feedback. This approach requires training a machine learning algorithm (in our case deep neural networks) directly on state-control pairs originating from optimal trajectories. We have shown in previous work that, when restricted to low-dimensional state and control spaces, this approach is very successful in several deterministic, non-linear problems in continuous-time. In this work, we refine our previous studies using as a test case a simple quadcopter model with quadratic and time-optimal objective functions. We describe in detail the best learning pipeline we have developed, that is able to approximate via deep neural networks the state-feedback map to a very high accuracy. We introduce the use of the softplus activation function in the hidden units of neural networks showing that it results in a smoother control profile whilst retaining the benefits of rectifiers. We show how to evaluate the optimality of the trained state-feedback, and find that already with two layers the objective function reached and its optimal value differ by less than one percent. We later consider also an additional metric linked to the system asymptotic behaviour-time taken to converge to the policy’s fixed point. With respect to these metrics, we show that improvements in the mean absolute error do not necessarily correspond to better policies.

This is a preview of subscription content, access via your institution.

References

  1. [1]

    Kirk, D. E. Optimal Control Theory. Prentice-Hall, 1970.

    Google Scholar 

  2. [2]

    Bardi, M., Capuzzo-Dolcetta, I. Continuous viscosity solutions of Hamilton-Jacobi equations. In: Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations. Systems & Control: Foundations & Applications. Birkhäuser, 1997: 25–96.

    Google Scholar 

  3. [3]

    Hadamard, J. Sur les problèmes aux dérivées partielles et leur signification physique. Princeton University Bulletin, 1902: 49–52.

    Google Scholar 

  4. [4]

    Beard, R. W., Saridis, G. N., Wen, J. T. Galerkin approximations of the generalized Hamilton-Jacobi-Bellman equation. Automatica, 1997, 33(12):2159–2177.

    MathSciNet  Article  Google Scholar 

  5. [5]

    Pontryagin, L. S., Boltyanskii, V., Gamkrelidze, R., Mishchenko, E. The Mathematical Theory of Optimal Processes. Interscience, 1962.

    Google Scholar 

  6. [6]

    Sánchez-Sánchez, C., Izzo, D. Real-time optimal control via deep neural networks: Study on landing problems. Journal of Guidance, Control, and Dynamics, 2018, 41(5):1122–1135.

    Article  Google Scholar 

  7. [7]

    Pomerleau, D. A. ALVINN: An autonomous land vehicle in a neural network. In: Proceedings of the 1st International Conference on Neural Information Processing Systems, 1988: 305–313.

    Google Scholar 

  8. [8]

    Ross, S., Bagnell, J. A. Efficient reductions for imitation learning. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010: 661–668.

    Google Scholar 

  9. [9]

    Mordatch, I., Todorov, E. Combining the benefits of function approximation and trajectory optimization. In: Robotics: Science and Systems, 2014.

    Google Scholar 

  10. [10]

    Ross, S., Gordon, G., Bagnell, J. A. A reduction of imitation learning and structured prediction to noregret online learning. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011: 627–635.

    Google Scholar 

  11. [11]

    Levine, S., Koltun, V. Guided policy search. In: Proceedings of the 30th International Conference on Machine Learning, 2013: 1–9.

    Google Scholar 

  12. [12]

    Levine, S., Koltun, V. Variational policy search via trajectory optimization. In: Proceedings of the 26th International Conference on Neural Information Processing Systems — Volume 1, 2013: 207–215.

    Google Scholar 

  13. [13]

    Stolle, M., Atkeson, C. G. Policies based on trajectory libraries. In: Proceedings of the 2006 IEEE International Conference on Robotics and Automation, 2006: 3344–3349.

    Google Scholar 

  14. [14]

    Furfaro, R, Bloise, I., Orlandelli, M., Di Lizia, P., Topputo, F., Linares, R. A recurrent deep architecture for quasi-optimal feedback guidance in planetary landing. In: Proceedings of the IAA SciTech Forum on Space Flight Mechanics and Space Structures and Materials, 2018: 1–24.

    Google Scholar 

  15. [15]

    Izzo, D., Sprague, C. I., Tailor, D. V. Machine learning and evolutionary techniques in interplanetary trajectory design. In: Modeling and Optimization in Space Engineering. Springer Optimization and Its Applications, Vol. 144. Fasano, G., Pintér, J. Eds. Springer Cham, 2019: 191–210.

    Google Scholar 

  16. [16]

    Hehn, M., Ritz, R., D’Andrea, R. Performance benchmarking of quadrotor systems using time-optimal control. Autonomous Robots, 2012, 33(1–2):69–88.

    Article  Google Scholar 

  17. [17]

    Betts, J. T. Survey of numerical methods for trajectory optimization. Journal of Guidance, Control, and Dynamics, 1998, 21(2):193–207.

    Article  Google Scholar 

  18. [18]

    Betts, J. T. Practical Methods for Optimal Control and Estimation using Nonlinear Programming. Society for Industrial and Applied Mathematics, 2010.

    Google Scholar 

  19. [19]

    Gill, P. E., Murray, W., Saunders, M. A. SNOPT: An SQP algorithm for large-scale constrained optimization. SIAM Review, 2005, 47(1):99–131.

    MathSciNet  Article  Google Scholar 

  20. [20]

    Nair, V., Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning, 2010: 807–814.

    Google Scholar 

  21. [21]

    Glorot, X., Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010: 249–256.

    Google Scholar 

  22. [22]

    Kingma, D. P., Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

    Google Scholar 

  23. [23]

    LeCun, Y. A., Bottou, L., Orr, G. B., Müller, K. R. Efficient BackProp. In:Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, Vol. 7700. Montavon, G., Orr, G. B., Müller, K. R. Eds. Springer Berlin Heidelberg, 2012: 9–48.

    Google Scholar 

  24. [24]

    Smith, L. N. Cyclical learning rates for training neural networks. In: Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision, 2017: 464–472.

    Google Scholar 

  25. [25]

    Bengio, Y. Practical recommendations for gradientbased training of deep architectures. In: Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, Vol. 7700. Montavon, G., Orr, G. B., Müller, K. R. Eds. Springer Berlin Heidelberg, 2012: 437–478.

    Google Scholar 

  26. [26]

    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15(1):1929–1958.

    MathSciNet  MATH  Google Scholar 

  27. [27]

    Ioffe, S., Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning — Volume 37, 2015: 448–456.

    Google Scholar 

  28. [28]

    He, K. M., Zhang, X. Y., Ren, S. Q., Sun, J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770–778.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Dario Izzo.

Additional information

Dharmesh Tailor has his bachelor degree in mathematics and computer science from Imperial College London (United Kingdom) and his master degree in artificial intelligence from the University of Edinburgh (United Kingdom). Following his studies, he joined the European Space Agency as a Young Graduate Trainee in the Advanced Concepts Team. His research looked at machine learning techniques for optimal control. He currently works at the RIKEN Center for AI Project (Japan) in the Approximate Bayesian Inference Team researching reinforcement learning and probabilistic inference.

Dario Izzo graduated as a doctor of aeronautical engineering from the University Sapienza of Rome (Italy). He then took his second master in “satellite platforms” at the University of Cranfield in the United Kingdom and completed his Ph.D. degree in mathematical modelling at the University Sapienza of Rome where he lectured classical mechanics and space flight mechanics. Dario Izzo later joined the European Space Agency and became the scientific coordinator of its Advanced Concepts Team. He devised and managed the Global Trajectory Optimization Competitions events, the ESA’s Summer of Code in Space and the Kelvins innovation and competition platform for space problems. He published more than 170 papers in international journals and conferences making key contributions to the understanding of flight mechanics and spacecraft control and pioneering techniques based on evolutionary and machine learning approaches. Dario Izzo received the Humies Gold Medal and led the team winning the 8th edition of the Global Trajectory Optimization Competition.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tailor, D., Izzo, D. Learning the optimal state-feedback via supervised imitation learning. Astrodyn 3, 361–374 (2019). https://doi.org/10.1007/s42064-019-0054-0

Download citation

Keywords

  • optimal control
  • deep learning
  • imitation learning
  • G&CNET