Skip to main content
Log in

Twin Deterministic Policy Gradient Adaptive Dynamic Programming for Optimal Control of Affine Nonlinear Discrete-time Systems

  • Regular Paper
  • Intelligent Control and Applications
  • Published:
International Journal of Control, Automation and Systems Aims and scope Submit manuscript

Abstract

Recent achievements in the field of adaptive dynamic programming (ADP), as well as the data resources and computational capabilities in modern control systems, have led to a growing interest in learning and data-driven control technologies. This paper proposes a twin deterministic policy gradient adaptive dynamic programming (TDPGADP) algorithm to solve the optimal control problem for a discrete-time affine nonlinear system in a model-free scenario. To solve the overestimation problem resulted from function approximation errors, the minimum value between the double Q network is taken to update the control policy. The convergence of the proposed algorithm in which the value function is served as the Lyapunov function is verified. By designing a twin actor-critic network structure, combining the target network and a specially designed adaptive experience replay mechanism, the algorithm is convenient to implement and the sample efficiency of the learning process can be improved. Two simulation examples are conducted to verify the efficacy of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement learning: A brief survey,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 26–38, 2017.

    Article  Google Scholar 

  2. D. Liu, S. Xue, B. Zhao, B. Luo, and Q. Wei, “Adaptive dynamic programming for control: A survey and recent advances,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 1, pp. 142–160, 2021.

    Article  Google Scholar 

  3. J. Xu, H. Wang, J. Rao, and J. Wang, “Zone scheduling optimization of pumps in water distribution networks with deep reinforcement learning and knowledge-assisted learning,” Soft Computing, vol. 25, pp. 14757–14767, 2021.

    Article  Google Scholar 

  4. M. C. Weinstein and R. J. Zeckhauser, “The optimal consumption of depletable natural resources,” The Quarterly Journal of Economics, vol. 89, no. 3, pp. 371–392, 1975.

    Article  Google Scholar 

  5. P. J. Werbos, “Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 17, no. 1, pp. 7–20, 1987.

    Article  Google Scholar 

  6. S. G. Papachristos, Adaptive Dynamic Programming in Inventory Control, The University of Manchester, United Kingdom, 1977.

    MATH  Google Scholar 

  7. R. Cui, C. Yang, Y. Li, and S. Sharma, “Adaptive neural network control of AUVs with control input nonlinearities using reinforcement learning,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 47, no. 6, pp. 1019–1029, 2017.

    Article  Google Scholar 

  8. J. Lu, Q. Wei, and F.-Y. Wang, “Parallel control for optimal tracking via adaptive dynamic programming,” IEEE/CAA Journal of Automatica Sinica, vol. 7, no. 6, pp. 1662–1674, 2020.

    Article  MathSciNet  Google Scholar 

  9. Q. Wei, R. Song, Z. Liao, B. Li, and F. L. Lewis, “Discrete-time impulsive adaptive dynamic programming,” IEEE Transactions on Cybernetics, vol. 50, no. 10, pp. 4293–4306, 2019.

    Article  Google Scholar 

  10. R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT press, 2018.

  11. W. Guo, J. Si, F. Liu, and S. Mei, “Policy approximation in policy iteration approximate dynamic programming for discrete-time nonlinear systems,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 7, pp. 2794–2807, 2017.

    MathSciNet  Google Scholar 

  12. Q. Wei, D. Liu, and H. Lin, “Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems,” IEEE Transactions on Cybernetics, vol. 46, no. 3, pp. 840–853, 2015.

    Article  Google Scholar 

  13. B. Luo, Y. Yang, H.-N. Wu, and T. Huang, “Balancing value iteration and policy iteration for discrete-time control,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 50, no. 11, pp. 3948–3958, 2019.

    Article  Google Scholar 

  14. D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 3, pp. 621–634, 2013.

    Article  Google Scholar 

  15. Q. Wei and D. Liu, “Neural-network-based adaptive optimal tracking control scheme for discrete-time nonlinear systems with approximation errors,” Neurocomputing, vol. 149, pp. 106–115, 2015.

    Article  Google Scholar 

  16. H. Li and D. Liu, “Optimal control for discrete-time affine non-linear systems using general value iteration,” IET Control Theory & Applications, vol. 6, no. 18, pp. 2725–2736, 2012.

    Article  MathSciNet  Google Scholar 

  17. B. Zhao, D. Wang, G. Shi, D. Liu, and Y. Li, “Decentralized control for large-scale nonlinear systems with unknown mismatched interconnections via policy iteration,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 48, no. 10, pp. 1725–1735, 2017.

    Article  Google Scholar 

  18. B. Luo, D. Liu, H.-N. Wu, D. Wang, and F. L. Lewis, “Policy gradient adaptive dynamic programming for data-based optimal control,” IEEE Transactions on Cybernetics, vol. 47, no. 10, pp. 3341–3354, 2016.

    Article  Google Scholar 

  19. B. Luo, Y. Yang, and D. Liu, “Adaptive q-learning for data-based optimal output regulation with experience replay,” IEEE Transactions on Cybernetics, vol. 48, no. 12, pp. 3337–3348, 2018.

    Article  Google Scholar 

  20. Y. Zhang, B. Zhao, and D. Liu, “Deterministic policy gradient adaptive dynamic programming for model-free optimal control,” Neurocomputing, vol. 387, pp. 40–50, 2020.

    Article  Google Scholar 

  21. S. Thrun and A. Schwartz, “Issues in using function approximation for reinforcement learning,” Proc. of the 4th Connectionist Models Summer School, pp. 255–263, Hillsdale, NJ, 1993.

  22. H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” Proc. of the 30th AAAI Conference on Artificial Intelligence, pp. 2094–2100, 2016.

  23. S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” Proc. of International Conference on Machine Learning, PMLR, pp. 1587–1596, 2018.

  24. R. Bellman, “On the theory of dynamic programming,” Proceedings of the National Academy of Sciences of the United States of America, vol. 38, no. 8, p. 716, 1952.

    Article  MathSciNet  Google Scholar 

  25. T. Degris, M. White, and R. Sutton, “Off-policy actor-critic,” Proc. of International Conference on Machine Learning, 2012.

  26. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

  27. P. Wawrzyński, “Real-time reinforcement learning by sequential actor-critics and experience replay,” Neural Networks, vol. 22, no. 10, pp. 1484–1497, 2009.

    Article  Google Scholar 

  28. R. Liu and J. Zou, “The effects of memory replay in reinforcement learning,” Proc. of 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), IEEE, pp. 478–485, 2018.

  29. Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, “Sample efficient actor-critic with experience replay,” arXiv preprint arXiv:1611.01224, 2016.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingcheng Wang.

Additional information

This work was supported by National Natural Science Foundation of China (61633019).

Jiahui Xu received his B.S. degree in the School of Electrical and Automation Engineering, Jilin University, Changchun, China, in 2019. He is currently pursuing a Ph.D. degree with the Department of Automation, Shanghai Jiao Tong University, Shanghai, China (2019-). His current research interests include reinforcement learning and optimal control.

Jingcheng Wang received his B.S. and M.S. degrees from Northwestern Polytechnic University, Xi’an, China, in 1992 and 1995, respectively, and a Ph.D. degree from Zhejiang University, Hangzhou, China, in 1998. He is a Former Research Fellow with Alexander von Humboldt Foundation, Rostock University, Rostock, Germany, and he is currently a Professor with Shanghai Jiao Tong University, Shanghai, China. His current research interests include robust control, intelligent control, and real-time control and simulation.

Jun Rao received his B.S. degree in electrical engineering and automation from Northwest A&F University, Yanglin, China, in 2020. He is currently pursuing a Ph.D. degree in the School of Automation from Shanghai Jiao Tong University, Shanghai, China. His current research interests include neural networks and reinforcement learning.

Yanjiu Zhong received his M.S. degree in the School of Mechanical Engineering, Hefei University of Technology, Hefei, China, in 2019. He is currently pursuing a Ph.D. degree in the School of Automation from Shanghai Jiao Tong University, Shanghai, China. His major research areas are non-convex optimization and reinforcement learning.

Shangwei Zhao received his B.S. degree in automation from Northeastern University, Shenyang, China, in 2018. He is currently pursuing a Ph.D. degree with the Department of Automation, Shanghai Jiao Tong University, Shanghai, China (2018-). His current research interests include adaptive dynamic programming, switched systems, and nonlinear system control.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, J., Wang, J., Rao, J. et al. Twin Deterministic Policy Gradient Adaptive Dynamic Programming for Optimal Control of Affine Nonlinear Discrete-time Systems. Int. J. Control Autom. Syst. 20, 3098–3109 (2022). https://doi.org/10.1007/s12555-021-0473-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12555-021-0473-6

Keywords

Navigation