Abstract
A decision-making (DM) agent models its environment and quantifies its DM preferences. An adaptive agent models them locally nearby the realisation of the behaviour of the closed DM loop. Due to this, a simple tool set often suffices for solving complex dynamic DM tasks. The inspected Bayesian agent relies on a unified learning and optimisation framework, which works well when tailored by making a range of case-specific options. Many of them can be made off-line. These options concern the sets of involved variables, the knowledge and preference elicitation, structure estimation, etc. Still, some meta-parameters need an on-line choice. This concerns, for instance, a weight balancing exploration with exploitation, a weight reflecting agent’s willingness to cooperate, a discounting factor, etc. Such options influence, often vitally, DM quality and their adaptive tuning is needed. Specific ways exist, for instance, a data-dependent choice of a forgetting factor serving to tracking of parameter changes. A general methodology is, however, missing. The paper opens a pathway to it. The solution uses a hierarchical feedback exploiting a generic, DM-related, observable, mismodelling indicator. The paper presents and justifies the theoretical concept, outlines and illustrates its use.
This is a preview of subscription content, access via your institution.




Notes
The prefix “meta” marks a task about a task, DM about DM, an option about an option, etc. Note that all abbreviations are summarised in Table 2 at the paper end.
The agent’s prior knowledge \(k^{0}\) implicitly conditions all pds involved. The knowledge \(k^{t}\) is also called information state. \((o_{t},a_{t})_{t\in \varvec{\{}{t} \varvec{\}}}\) is often referred as (closed DM loop) trajectory or the observed behaviour.
KLD, formerly called cross-entropy, Kullback and Leibler [67], now relative entropy, is the DM-rules-dependent expectation of the loss \(\ln (\mathsf {j}/\mathsf {j}_{\mathfrak {i}})\).
The usual MDP deals with the reward \(-\mathsf {L}\) and maximises its expectation.
The same choice is faced when dealing with usual exploration techniques, Ouyang et al. [53].
The term trust has narrower meaning than numerous studies focused on it, Li and Song [47].
Extensive references on the whole approach can be found in the cited paper. The chapter, Dietrich and List [10], is a good starting point to pooling problems that are in the core of such a cooperation.
In this context, Shannon’s sampling theorem, Shannon [66], provides no guide.
The dependence of pds on the horizon h is made explicit here.
For a pd \(\mathsf {s}\) on \(\varvec{\{}{x} \varvec{\}}\), its support \(\mathrm {supp}[\mathsf {s}]\equiv \{x\in \varvec{\{}{x} \varvec{\}}:\,\mathsf {s}(x)>0\}\).
References
Algoet P, Cover T (1988) A sandwich proof of the Shannon-McMillan-Breiman theorem. Ann Probab 16:899–909
Åström K, Wittenmark B (1994) Adaptive control, 2nd edn. Addison-Wesley, New York
Beckenbach L, Osinenko P, Streif S (2020) A Q-learning predictive control scheme with guaranteed stability. Eur J Control 56:167–178
Berec L, Kárný M (1997) Identification of reality in Bayesian context. In: Kárný M, Warwick K (eds) Computer-intensive methods in control and signal processing. Birkhäuser, Basel, pp 181–193
Berger J (1985) Statistical decision theory and Bayesian analysis. Springer, Berlin
Bernardo J (1979) Expected information as expected utility. Ann Stat 7:686–690
Bertsekas D (2017) Dynamic programming and optimal control. Athena Scientific, Nashua
Bogdan P, Pedram M (2018) Toward enabling automated cognition and decision-making in complex cyber-physical systems. In: 2018 IEEE ISCAS, pp 1–4
Diebold F, Shin M (2019) Machine learning for regularized survey forecast combination: Partially-egalitarian LASSO and its derivatives. Int J Forecast 35:1679–1691
Dietrich F, List C (2016) Probabilistic opinion pooling. In: Hitchcock C, Hajek A (eds) Oxford handbook of philosophy and probability. Oxford University Press, Oxford
Doob J (1953) Stochastic processes. Wiley, Hoboken
Doyle J (2013) Survey of time preference, delay discounting models. Judge Decis Mak 8:116–135
Duvenaud D (2014) Automatic model construction with Gaussian processes. PhD thesis, Pembroke College, University of Cambridge
Feldbaum A (1961) Theory of dual control. Autom Remote Control 22:3–19
Gaitsgory V, Grüne L, Höger M, Kellett C, Weller S (2018) Stabilization of strictly dissipative discrete time systems with discounted optimal control. Automatica 93:311–320. https://doi.org/10.1016/j.automatica.2018.03.076
Ghavamzadeh M, Mannor S, Pineau J, Tamar A (2015) Bayesian reinforcement learning: a survey. Found Trends Mach Learn 8(5–6):359–483. https://doi.org/10.1561/2200000049
Grünwald P, Langford J (2007) Suboptimal behavior of Bayes and MDL in classification under misspecification. Mach Learn 66(2–3):119–149
Guan P, Raginsky M, Willett R (2014) Online Markov decision processes with Kullback Leibler control cost. IEEE Trans AC 59(6):1423–1438
Guy TV, Kárný M (2000) Design of an adaptive controller of LQG type: spline-based approach. Kybernetika 36(2):255–262
Hebb D (2005) The organization of behavior: a neuropsychological theory. Taylor & Francis. https://books.google.cz/books?id=uyV5AgAAQBAJ. Accessed 15 Dec 2019
Hospedales T, Antoniou A, Micaelli P, Storkey A (2020) Meta-learning in neural networks: A survey arXiv:2004.05439v1 [cs.LG]. Accessed 11 Apr 2020
Ishii S, Yoshida W, Yoshimoto J (2002) Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Netw 15(4–6):665–687
Jacobs O, Patchell J (1972) Caution and probing in stochastic control. Int J Control 16(1):189–199
Jazwinski A (1970) Stochastic processes and filtering theory. Ac. Press, Pleasantville
Kandasamy K, Schneider J, Póczo B (2015) High dimensional Bayesian optimisation and bandits via additive models. In: International conference on machine learning, proceedings mlr.press, vol 37
Kárný M (1991) Estimation of control period for selftuners. Automatica 27(2):339–348 ((, extended version of the paper presented at 11th IFAC World Congr. , Tallinn))
Kárný M (1996) Towards fully probabilistic control design. Automatica 32(12):1719–1722
Kárný M (2020) Axiomatisation of fully probabilistic design revisited. Syst Con Lett. https://doi.org/10.1016/j.sysconle.2020.104719
Kárný M (2020) Minimum expected relative entropy principle. In: Proceedings of the 18th ECC, IFAC, Sankt Petersburg, pp 35–40
Kárný M, Alizadeh Z (2019) Towards fully probabilistic cooperative decision making. In: Slavkovik M (ed) Multi-agent systems, EUMAS 2018, vol LNAI 11450. Springer Nature, Dordrecht, pp 1–16
Kárný M, Guy T (2012) On support of imperfect Bayesian participants. In: Guy T et al (eds) Decision making with imperfect decision makers, vol 28. Springer, Int, Syst. Ref. Lib., Berlin, pp 29–56
Kárný M, Guy T (2019) Preference elicitation within framework of fully probabilistic design of decision strategies. In: IFAC International Workshop on Adaptive and Learning Control Systems, vol 52. pp 239–244
Kárný M, Hůla F (2019) Balancing exploitation and exploration via fully probabilistic design of decision policies. In: Proceedings of the 11th International Conference on Agents and Artificial Intelligence: ICAART, vol 2. pp 857–864
Kárný M, Kroupa T (2012) Axiomatisation of fully probabilistic design. Inf Sci 186(1):105–113
Kárný M, Halousková A, Böhm J, Kulhavý R, Nedoma P (1985) Design of linear quadratic adaptive control: theory and algorithms for practice. Kybernetika 21(supp. Nos 3–6):1–96
Kárný M, Böhm J, Guy T, Jirsa L, Nagy I, Nedoma P, Tesař L (2006) Optimized Bayesian dynamic advising: theory and algorithms. Springer, London
Kárný M, Bodini A, Guy T, Kracík J, Nedoma P, Ruggeri F (2014) Fully probabilistic knowledge expression and incorporation. Stat Interface 7(4):503–515
Klenske E, Hennig P (2016) Dual control for approximate Bayesian reinforcement learning. J Mach Learn Res 17:1–30
Kober J, Peters J (2011) Policy search for motor primitives in robotics. Mach Learn 84(1):171–203. https://doi.org/10.1007/s10994-010-5223-6
Kracík J, Kárný M (2005) Merging of data knowledge in Bayesian estimation. In: Filipe J et al (eds) Proceedings of the 2nd International Conference on informatics in control, automation and robotics, Barcelona, pp 229–232
Kulhavý R, Zarrop MB (1993) On a general concept of forgetting. Int J Control 58(4):905–924
Kullback S, Leibler R (1951) On information and sufficiency. Ann Math Stat 22:79–87
Kumar EV, Jerome J, Srikanth K (2014) Algebraic approach for selecting the weighting matrices of linear quadratic regulator. In: 2014 International Conference on green computing communication and electrical engineering (ICGCCEE), pp 1–6. https://doi.org/10.1109/ICGCCEE.2014.6922382
Kumar P (1985) A survey on some results in stochastic adaptive control. SIAM J Control Appl 23:399–409
Larsson D, Braun D, Tsiotrasz P (2017) Hierarchical state abstractions for decision-making problems with computational constraints. arXiv:1710.07990v1 [cs.AI], Accessed 22 Oct 2017
Lee K, Kim G, Ortega P, Lee D, Kim K (2019) Bayesian optimistic Kullback-Leibler exploration. Mach Learn 108(5):765–783. https://doi.org/10.1007/s10994-018-5767-4
Li W, Song H (2016) ART: an attack-resistant trust management scheme for securing vehicular ad hoc networks. IEEE Trans Intell Transport Syst 17:960–969
Liao Y, Deschamps F, Loures E, Ramos L (2017) Past, present and future of industry 4.0—a systematic literature review and research agenda proposal. Int J Prod Res 55(12):3609–3629
Mayne D (2014) Model predictive control: recent developments and future promise. Automatica 50:2967–2986
Meditch J (1969) Stochastic optimal linear estimation and control. McGraw Hill, New York
Mesbah A (2018) Stochastic model predictive control with active uncertainty learning: a survey on dual control. Ann Rev Control 45:107–117. https://doi.org/10.1016/j.arcontrol.2017.11.001. http://www.sciencedirect.com/science/article/pii/S1367578817301232
Moerland TM, Broekens J, Jonker CM (2018) Emotion in reinforcement learning agents and robots: a survey. Mach Learn 107(2):443–480. https://doi.org/10.1007/s10994-017-5666-0
Ouyang Y, Gagrani M, Nayyar A, Jain R (2017) Learning unknown Markov decision processes: a Thompson sampling approach. In: von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R (eds) Advances in neural information processing systems 30. Curran Associates, Inc., pp 1333–1342
Peterka V (1972) On steady-state minimum variance control strategy. Kybernetika 8:219–231
Peterka V (1975) A square-root filter for real-time multivariable regression. Kybernetika 11:53–67
Peterka V (1981) Bayesian system identification. In: Eykhoff P (ed) Trends and progress in system identification. Perg. Press, pp 239–304
Peterka V (1991) Adaptation for LQG control design to engineering needs. In: Warwick K, Kárný M, Halousková A (eds) Lecture notes: adv. methods in adaptive control for industrial application; Joint UK-CS seminar, vol 158. Springer-Verlag, NY
Peterka V, Astrom K (1973) Control of multivariable systems with unknown but constant parameters. In: Prepr. of the 3rd IFAC Symp. on identification and process parameter estimation, IFAC, Hague, Delft, pp 534–544
Puterman M (2005) Markov decision processes: discrete stochastic dynamic programming. Wiley, Hoboken
Quinn A, Kárný M, Guy T (2016) Fully probabilistic design of hierarchical Bayesian models. Inf Sci 369:532–547
Rao M (1987) Measure theory and integration. Wiley, Hoboken
Rohrs C, Valavani L, Athans M, Stein G (1982) Robustness of adaptive control algorithms in the presence of unmodeled dynamics. In: IEEE Conference on Decision and Control, Orlando, FL, vol 1, pp 3–11
Sandholm T (1999) Distributed rational decision making. In: Weiss G (ed) Multiagent systems—a modern approach to distributed artificial intelligence. MIT Press, Cambridge, pp 201–258
Savage L (1954) Foundations of statistics. Wiley, Hoboken
Schweighofer N, Doya K (2003) Meta-learning in reinforcement learning. Neural Netw 16(1):5–9. https://doi.org/10.1016/S0893-6080(02)00228-9
Shannon C (1948) A mathematical theory of communication. Bell Syst Tech J 27(379–423):623–656
Shore J, Johnson R (1980) Axiomatic derivation of the principle of maximum entropy & the principle of minimum cross-entropy. IEEE Trans Inf Th 26(1):26–37
Si J, Barto A, Powell W, Wunsch D (eds) (2004) Handbook of learning and approximate dynamic programming. Wiley-IEEE Press, Hoboken
Tanner M (1993) Tools for statistical inference. Springer Verlag, New York
Tao G (2014) Multivariable adaptive control: a survey. Automatica 50(11):2737–2764
Ullrich M (1964) Optimum control of some stochastic systems. In: Prepr. of the VIII-th conf. ETAN, Beograd
Wolpert D, Macready W (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82
Wu H, Guo X, Liu X (2017) Adaptive exploration-exploitation trade off for opportunistic bandits. Preprint at arXiv:1709.04004
Yang Z, Wang C, Zhang Z, Li J (2019) Mini-batch algorithms with online step size. Knowledge-Based Systems 165:228–240
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The author has no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript. This manuscript has not been submitted to, nor is under review at, another journal or other publishing venue.
Funding
The reported research has been supported by MŠMT ČR LTC18075 and EU-COST Action CA16228.
Availability of data and material
Not applicable
Code availability
The source code of the example is available upon a request.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kárný, M. Towards on-line tuning of adaptive-agent’s multivariate meta-parameter. Int. J. Mach. Learn. & Cyber. 12, 2717–2731 (2021). https://doi.org/10.1007/s13042-021-01358-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-021-01358-w
Keywords
- Bayesian learning
- Adaptive agent
- Meta-parameter tuning
- Fully probabilistic design
- Kullback–Leibler divergence
- Dynamic decision making