Abstract
This paper explores the problem of formalizing the development of autonomous artificial intelligence systems (AAISs) whose mathematical models may be complex or non-identifiable. Using the value-iterations method for Q-functions of rewards, a methodology for constructing ε-optimal strategies with a given accuracy was developed. The results allow us to outline classes (including dual-use), for which it is possible to rigorously justify the construction of optimal and ε-optimal strategies even in cases where the models are identifiable, but the computational complexity of standard dynamic programming algorithms may not be strictly polynomial.
Similar content being viewed by others
References
E. A. Feinberg, M. A. Bender, M. T. Curry, D. Huang, T. Koutsoudis, and J. L. Bernstein, “Sensor resource management for an airborne early warning radar,” in: O. E. Drummond (ed.), Signal and Data Processing of Small Targets, Proc. of SPIE, Vol. 4728 (2002), pp. 145–156. https://doi.org/10.1117/12.478500.
E. A. Feinberg, P. O. Kasyanov, and M. Z. Zgurovsky, “Continuity of equilibria for twoperson zero-sum games with noncompact action sets and unbounded payoffs,” Ann. Oper. Res., Vol. 317, 537–568 (2022). https://doi.org/10.1007/s10479-017-2677-y.
E. A. Feinberg, P. O. Kasyanov, and M. Z. Zgurovsky, “A class of solvable Markov decision models with incomplete information,” in: 2021 60th IEEE Conf. on Decision and Control (CDC), Austin, TX, USA (2021), pp. 1615–1620, https://doi.org/10.1109/CDC45484.2021.9683160.
V. Myers and D. P. Williams, “Adaptive multiview target classification in synthetic aperture sonar images using a partially observable Markov decision process,” IEEE J. Ocean. Eng., Vol. 37, No. 1, 45–55 (2012). https://doi.org/10.1109/JOE.2011.2175510.
A. B. Piunovskiy, Examples in Markov Decision Processes, Imperial College Press., London (2012). https://doi.org/10.1142/p809.
M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Inc. (2005). https://doi.org/10.1002/9780470316887.
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, Cambridge–London (2018).
C. Y. Wakayama and Z. B. Zabinsky, “Simulation-driven task prioritization using a restless bandit model for active sonar missions,” in: 2015 Winter Simulation Conf. (WSC), Huntington Beach, CA, USA (2015), pp. 3725–3736. https://doi.org/10.1109/WSC.2015.7408530.
W. A. Wallis, “The statistical research group, 1942–1945,” J. Am. Stat. Assoc., Vol. 75, No. 370, 320–330 (1980). https://doi.org/10.2307/2287451.
V. Yordanova, H. Griffiths, and S. Hailes, “Rendezvous planning for multiple autonomous underwater vehicles using a Markov decision process,” IET Radar, Sonar Navig., Vol. 11, No. 12, 1762–1769 (2017). https://doi.org/10.1049/iet-rsn.2017.0098.
D. Silver, S. Singh, D. Precup, and R. S. Sutton, “Reward is enough,” Artif. Intell., Vol. 299, 103535 (2021). https://doi.org/10.1016/j.artint.2021.103535.
A. D. Kara, N. Saldi, and S. Yüksel, “Q-learning for MDPs with general spaces: Convergence and near optimality via quantization under weak continuity,” arXiv:2111.06781v1 [cs.LG] 12 Nov (2021). https://doi.org/10.48550/arXiv.2111.06781.
A. D. Kara and S. Yüksel, “Convergence of finite memory Q-learning for POMDPs and near optimality of learned policies under filter stability,” Math. Oper. Res. (2022). https://doi.org/10.1287/moor.2022.1331.
K. R. Parthasarathy, Probability Measures on Metric Spaces, Academic Press, New York (1967).
D. P. Bertsekas and S. E. Shreve, Stochastic Optimal Control: The Discrete-Time Case, Athena Scientific, Belmont, MA (1996).
O. Hernández-Lerma and J. B. Lassere, Discrete-Time Markov Control Processes: Basic Optimality Criteria, Springer, New York (1996). https://doi.org/10.1007/978-1-4612-0729-0.
E. A. Feinberg, P. O. Kasyanov, and N. V. Zadoianchuk, “Berge’s theorem for noncompact image sets,” J. Math. Anal. Appl., Vol. 397, Iss. 1, 255–259 (2013). https://doi.org/10.1016/j.jmaa.2012.07.051.
E. A. Feinberg, P. O. Kasyanov, and N. V. Zadoianchuk, “Average-cost Markov decision processes with weakly continuous transition probabilities,” Math. Oper. Res., Vol. 37, No. 4, 591–607 (2012). https://doi.org/10.1287/moor.1120.0555.
D. Rhenius, “Incomplete information in Markovian decision models,” Ann. Statist., Vol. 2, No. 6, 1327–1334 (1974). DOI: https://doi.org/10.1214/aos/1176342886 .
A. A. Yushkevich, “Reduction of a controlled Markov model with incomplete data to a problem with complete information in the case of Borel state and control spaces,” Theory Probab., Vol. 21, No. 1, 153–158 (1976). https://doi.org/10.1137/1121014.
E. B. Dynkin and A. A. Yushkevich, Controlled Markov Processes, Springer-Verlag, New York (1979).
D. Bertsekas, “Multiagent rollout algorithms and reinforcement learning,” arXiv:1910.00120 [cs.LG], 30 Sep (2019). https://doi.org/10.48550/arXiv.1910.00120.
O. Hernández-Lerma, Adaptive Markov Control Processes, Springer, New York, (1989). https://doi.org/10.1007/978-1-4419-8714-3.
E. A. Feinberg, P. O. Kasyanov, and M. Z. Zgurovsky, “Markov decision processes with incomplete information and semiuniform feller transition probabilities,” SIAM J. Control Optim., Vol. 60, No. 4, 2488–2513 (2022). https://doi.org/10.1137/21M1442152.
E. J. Sondik, “The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs,” Oper. Res., Vol. 26, No. 2, 282–304 (1978). https://doi.org/10.1287/opre.26.2.282.
O. Hernández-Lerma and J. B. Lasserre, Further Topics on Discrete-Time Markov Control Processes, Springer Science & Business Media, New York (2012).
E. A. Feinberg, P. O. Kasyanov, and M. Z. Zgurovsky, “Convergence of value iterations for total-cost MDPs and POMDPs with general state and action sets,” in: 2014 IEEE Symp. on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), Orlando, FL, USA (2014), pp. 1–8. doi: https://doi.org/10.1109/ADPRL.2014.7010613.
C. Szepesvári, Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, Springer, Cham (2010). https://doi.org/10.1007/978-3-031-01551-9.
M. Rempel and J. Cai, “A review of approximate dynamic programming applications within military operations research,” Oper. Res. Perspect., Vol. 8, 100204 (2021). https://doi.org/10.1016/j.orp.2021.100204.
Science & Technology Strategy for Intelligent Autonomous Systems, Department of the Navy, July 2 (2021). URL: https://www.nre.navy.mil/media/document/department-navy-science-technology-strategy-intelligent-autonomous-systems.
E. A. Feinberg and J. Huang, “The value iteration algorithm is not strongly polynomial for discounted dynamic programming,” Oper. Res. Lett., Vol. 42, Iss. 2, 130–131 (2014). https://doi.org/10.1016/j.orl.2013.12.011.
G. Arslan, S. Yüksel, “Decentralized Q-learning for stochastic teams and games,” IEEE Trans. Autom. Control, Vol. 62, No. 4, 1545–1558 (2017). https://doi.org/10.1109/TAC.2016.2598476.
Author information
Authors and Affiliations
Corresponding author
Additional information
Translated from Kibernetyka ta Systemnyi Analiz, No. 5, September–October, 2023, pp. 89–99.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zgurovsky, M.Z., Kasyanov, P.O. & Levenchuk, L.B. Formalization of Methods for the Development of Autonomous Artificial Intelligence Systems. Cybern Syst Anal 59, 763–771 (2023). https://doi.org/10.1007/s10559-023-00612-z
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10559-023-00612-z