Imitation learning is a control design paradigm that seeks to learn a control policy reproducing demonstrations from expert agents. By substituting expert demonstrations for optimal behaviours, the same paradigm leads to the design of control policies closely approximating the optimal state-feedback. This approach requires training a machine learning algorithm (in our case deep neural networks) directly on state-control pairs originating from optimal trajectories. We have shown in previous work that, when restricted to low-dimensional state and control spaces, this approach is very successful in several deterministic, non-linear problems in continuous-time. In this work, we refine our previous studies using as a test case a simple quadcopter model with quadratic and time-optimal objective functions. We describe in detail the best learning pipeline we have developed, that is able to approximate via deep neural networks the state-feedback map to a very high accuracy. We introduce the use of the softplus activation function in the hidden units of neural networks showing that it results in a smoother control profile whilst retaining the benefits of rectifiers. We show how to evaluate the optimality of the trained state-feedback, and find that already with two layers the objective function reached and its optimal value differ by less than one percent. We later consider also an additional metric linked to the system asymptotic behaviour-time taken to converge to the policy’s fixed point. With respect to these metrics, we show that improvements in the mean absolute error do not necessarily correspond to better policies.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Kirk, D. E. Optimal Control Theory. Prentice-Hall, 1970.
Bardi, M., Capuzzo-Dolcetta, I. Continuous viscosity solutions of Hamilton-Jacobi equations. In: Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations. Systems & Control: Foundations & Applications. Birkhäuser, 1997: 25–96.
Hadamard, J. Sur les problèmes aux dérivées partielles et leur signification physique. Princeton University Bulletin, 1902: 49–52.
Beard, R. W., Saridis, G. N., Wen, J. T. Galerkin approximations of the generalized Hamilton-Jacobi-Bellman equation. Automatica, 1997, 33(12):2159–2177.
Pontryagin, L. S., Boltyanskii, V., Gamkrelidze, R., Mishchenko, E. The Mathematical Theory of Optimal Processes. Interscience, 1962.
Sánchez-Sánchez, C., Izzo, D. Real-time optimal control via deep neural networks: Study on landing problems. Journal of Guidance, Control, and Dynamics, 2018, 41(5):1122–1135.
Pomerleau, D. A. ALVINN: An autonomous land vehicle in a neural network. In: Proceedings of the 1st International Conference on Neural Information Processing Systems, 1988: 305–313.
Ross, S., Bagnell, J. A. Efficient reductions for imitation learning. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010: 661–668.
Mordatch, I., Todorov, E. Combining the benefits of function approximation and trajectory optimization. In: Robotics: Science and Systems, 2014.
Ross, S., Gordon, G., Bagnell, J. A. A reduction of imitation learning and structured prediction to noregret online learning. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011: 627–635.
Levine, S., Koltun, V. Guided policy search. In: Proceedings of the 30th International Conference on Machine Learning, 2013: 1–9.
Levine, S., Koltun, V. Variational policy search via trajectory optimization. In: Proceedings of the 26th International Conference on Neural Information Processing Systems — Volume 1, 2013: 207–215.
Stolle, M., Atkeson, C. G. Policies based on trajectory libraries. In: Proceedings of the 2006 IEEE International Conference on Robotics and Automation, 2006: 3344–3349.
Furfaro, R, Bloise, I., Orlandelli, M., Di Lizia, P., Topputo, F., Linares, R. A recurrent deep architecture for quasi-optimal feedback guidance in planetary landing. In: Proceedings of the IAA SciTech Forum on Space Flight Mechanics and Space Structures and Materials, 2018: 1–24.
Izzo, D., Sprague, C. I., Tailor, D. V. Machine learning and evolutionary techniques in interplanetary trajectory design. In: Modeling and Optimization in Space Engineering. Springer Optimization and Its Applications, Vol. 144. Fasano, G., Pintér, J. Eds. Springer Cham, 2019: 191–210.
Hehn, M., Ritz, R., D’Andrea, R. Performance benchmarking of quadrotor systems using time-optimal control. Autonomous Robots, 2012, 33(1–2):69–88.
Betts, J. T. Survey of numerical methods for trajectory optimization. Journal of Guidance, Control, and Dynamics, 1998, 21(2):193–207.
Betts, J. T. Practical Methods for Optimal Control and Estimation using Nonlinear Programming. Society for Industrial and Applied Mathematics, 2010.
Gill, P. E., Murray, W., Saunders, M. A. SNOPT: An SQP algorithm for large-scale constrained optimization. SIAM Review, 2005, 47(1):99–131.
Nair, V., Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning, 2010: 807–814.
Glorot, X., Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010: 249–256.
Kingma, D. P., Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
LeCun, Y. A., Bottou, L., Orr, G. B., Müller, K. R. Efficient BackProp. In:Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, Vol. 7700. Montavon, G., Orr, G. B., Müller, K. R. Eds. Springer Berlin Heidelberg, 2012: 9–48.
Smith, L. N. Cyclical learning rates for training neural networks. In: Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision, 2017: 464–472.
Bengio, Y. Practical recommendations for gradientbased training of deep architectures. In: Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, Vol. 7700. Montavon, G., Orr, G. B., Müller, K. R. Eds. Springer Berlin Heidelberg, 2012: 437–478.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15(1):1929–1958.
Ioffe, S., Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning — Volume 37, 2015: 448–456.
He, K. M., Zhang, X. Y., Ren, S. Q., Sun, J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770–778.
Dharmesh Tailor has his bachelor degree in mathematics and computer science from Imperial College London (United Kingdom) and his master degree in artificial intelligence from the University of Edinburgh (United Kingdom). Following his studies, he joined the European Space Agency as a Young Graduate Trainee in the Advanced Concepts Team. His research looked at machine learning techniques for optimal control. He currently works at the RIKEN Center for AI Project (Japan) in the Approximate Bayesian Inference Team researching reinforcement learning and probabilistic inference.
Dario Izzo graduated as a doctor of aeronautical engineering from the University Sapienza of Rome (Italy). He then took his second master in “satellite platforms” at the University of Cranfield in the United Kingdom and completed his Ph.D. degree in mathematical modelling at the University Sapienza of Rome where he lectured classical mechanics and space flight mechanics. Dario Izzo later joined the European Space Agency and became the scientific coordinator of its Advanced Concepts Team. He devised and managed the Global Trajectory Optimization Competitions events, the ESA’s Summer of Code in Space and the Kelvins innovation and competition platform for space problems. He published more than 170 papers in international journals and conferences making key contributions to the understanding of flight mechanics and spacecraft control and pioneering techniques based on evolutionary and machine learning approaches. Dario Izzo received the Humies Gold Medal and led the team winning the 8th edition of the Global Trajectory Optimization Competition.
About this article
Cite this article
Tailor, D., Izzo, D. Learning the optimal state-feedback via supervised imitation learning. Astrodyn 3, 361–374 (2019). https://doi.org/10.1007/s42064-019-0054-0
- optimal control
- deep learning
- imitation learning