Skip to main content

Towards on-line tuning of adaptive-agent’s multivariate meta-parameter


A decision-making (DM) agent models its environment and quantifies its DM preferences. An adaptive agent models them locally nearby the realisation of the behaviour of the closed DM loop. Due to this, a simple tool set often suffices for solving complex dynamic DM tasks. The inspected Bayesian agent relies on a unified learning and optimisation framework, which works well when tailored by making a range of case-specific options. Many of them can be made off-line. These options concern the sets of involved variables, the knowledge and preference elicitation, structure estimation, etc. Still, some meta-parameters need an on-line choice. This concerns, for instance, a weight balancing exploration with exploitation, a weight reflecting agent’s willingness to cooperate, a discounting factor, etc. Such options influence, often vitally, DM quality and their adaptive tuning is needed. Specific ways exist, for instance, a data-dependent choice of a forgetting factor serving to tracking of parameter changes. A general methodology is, however, missing. The paper opens a pathway to it. The solution uses a hierarchical feedback exploiting a generic, DM-related, observable, mismodelling indicator. The paper presents and justifies the theoretical concept, outlines and illustrates its use.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4


  1. The prefix “meta” marks a task about a task, DM about DM, an option about an option, etc. Note that all abbreviations are summarised in Table 2 at the paper end.

  2. The agent’s prior knowledge \(k^{0}\) implicitly conditions all pds involved. The knowledge \(k^{t}\) is also called information state. \((o_{t},a_{t})_{t\in \varvec{\{}{t} \varvec{\}}}\) is often referred as (closed DM loop) trajectory or the observed behaviour.

  3. KLD, formerly called cross-entropy, Kullback and Leibler [67], now relative entropy, is the DM-rules-dependent expectation of the loss \(\ln (\mathsf {j}/\mathsf {j}_{\mathfrak {i}})\).

  4. The usual MDP deals with the reward \(-\mathsf {L}\) and maximises its expectation.

  5. This reflects its interpretation as a meta-action at the upper-level feedback, cf. Fig. 1 and Sect. 5.

  6. The same choice is faced when dealing with usual exploration techniques, Ouyang et al. [53].

  7. The term trust has narrower meaning than numerous studies focused on it, Li and Song [47].

  8. This form of Bayes’ rule is valid for the considered DM rules for which the parameter pointing to the “best” model, [4], is unknown, cf. natural conditions of control in Peterka [56].

  9. Extensive references on the whole approach can be found in the cited paper. The chapter, Dietrich and List [10], is a good starting point to pooling problems that are in the core of such a cooperation.

  10. In this context, Shannon’s sampling theorem, Shannon [66], provides no guide.

  11. The dependence of pds on the horizon h is made explicit here.

  12. For a pd \(\mathsf {s}\) on \(\varvec{\{}{x} \varvec{\}}\), its support \(\mathrm {supp}[\mathsf {s}]\equiv \{x\in \varvec{\{}{x} \varvec{\}}:\,\mathsf {s}(x)>0\}\).

  13. The proof tailors and refines results in Algoet and Cover [1, 4].


  1. Algoet P, Cover T (1988) A sandwich proof of the Shannon-McMillan-Breiman theorem. Ann Probab 16:899–909

    MathSciNet  MATH  Google Scholar 

  2. Åström K, Wittenmark B (1994) Adaptive control, 2nd edn. Addison-Wesley, New York

    MATH  Google Scholar 

  3. Beckenbach L, Osinenko P, Streif S (2020) A Q-learning predictive control scheme with guaranteed stability. Eur J Control 56:167–178

    Article  MathSciNet  MATH  Google Scholar 

  4. Berec L, Kárný M (1997) Identification of reality in Bayesian context. In: Kárný M, Warwick K (eds) Computer-intensive methods in control and signal processing. Birkhäuser, Basel, pp 181–193

    Chapter  MATH  Google Scholar 

  5. Berger J (1985) Statistical decision theory and Bayesian analysis. Springer, Berlin

    Book  MATH  Google Scholar 

  6. Bernardo J (1979) Expected information as expected utility. Ann Stat 7:686–690

    Article  MathSciNet  MATH  Google Scholar 

  7. Bertsekas D (2017) Dynamic programming and optimal control. Athena Scientific, Nashua

    MATH  Google Scholar 

  8. Bogdan P, Pedram M (2018) Toward enabling automated cognition and decision-making in complex cyber-physical systems. In: 2018 IEEE ISCAS, pp 1–4

  9. Diebold F, Shin M (2019) Machine learning for regularized survey forecast combination: Partially-egalitarian LASSO and its derivatives. Int J Forecast 35:1679–1691

    Article  Google Scholar 

  10. Dietrich F, List C (2016) Probabilistic opinion pooling. In: Hitchcock C, Hajek A (eds) Oxford handbook of philosophy and probability. Oxford University Press, Oxford

    Google Scholar 

  11. Doob J (1953) Stochastic processes. Wiley, Hoboken

    MATH  Google Scholar 

  12. Doyle J (2013) Survey of time preference, delay discounting models. Judge Decis Mak 8:116–135

    Google Scholar 

  13. Duvenaud D (2014) Automatic model construction with Gaussian processes. PhD thesis, Pembroke College, University of Cambridge

  14. Feldbaum A (1961) Theory of dual control. Autom Remote Control 22:3–19

    MathSciNet  Google Scholar 

  15. Gaitsgory V, Grüne L, Höger M, Kellett C, Weller S (2018) Stabilization of strictly dissipative discrete time systems with discounted optimal control. Automatica 93:311–320.

    Article  MathSciNet  MATH  Google Scholar 

  16. Ghavamzadeh M, Mannor S, Pineau J, Tamar A (2015) Bayesian reinforcement learning: a survey. Found Trends Mach Learn 8(5–6):359–483.

    Article  MATH  Google Scholar 

  17. Grünwald P, Langford J (2007) Suboptimal behavior of Bayes and MDL in classification under misspecification. Mach Learn 66(2–3):119–149

    Article  MATH  Google Scholar 

  18. Guan P, Raginsky M, Willett R (2014) Online Markov decision processes with Kullback Leibler control cost. IEEE Trans AC 59(6):1423–1438

    Article  MathSciNet  MATH  Google Scholar 

  19. Guy TV, Kárný M (2000) Design of an adaptive controller of LQG type: spline-based approach. Kybernetika 36(2):255–262

    MathSciNet  MATH  Google Scholar 

  20. Hebb D (2005) The organization of behavior: a neuropsychological theory. Taylor & Francis. Accessed 15 Dec 2019

  21. Hospedales T, Antoniou A, Micaelli P, Storkey A (2020) Meta-learning in neural networks: A survey arXiv:2004.05439v1 [cs.LG]. Accessed 11 Apr 2020

  22. Ishii S, Yoshida W, Yoshimoto J (2002) Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Netw 15(4–6):665–687

    Article  Google Scholar 

  23. Jacobs O, Patchell J (1972) Caution and probing in stochastic control. Int J Control 16(1):189–199

    Article  MATH  Google Scholar 

  24. Jazwinski A (1970) Stochastic processes and filtering theory. Ac. Press, Pleasantville

    MATH  Google Scholar 

  25. Kandasamy K, Schneider J, Póczo B (2015) High dimensional Bayesian optimisation and bandits via additive models. In: International conference on machine learning, proceedings, vol 37

  26. Kárný M (1991) Estimation of control period for selftuners. Automatica 27(2):339–348 ((, extended version of the paper presented at 11th IFAC World Congr. , Tallinn))

    Article  MathSciNet  MATH  Google Scholar 

  27. Kárný M (1996) Towards fully probabilistic control design. Automatica 32(12):1719–1722

    Article  MathSciNet  MATH  Google Scholar 

  28. Kárný M (2020) Axiomatisation of fully probabilistic design revisited. Syst Con Lett.

    Article  MathSciNet  MATH  Google Scholar 

  29. Kárný M (2020) Minimum expected relative entropy principle. In: Proceedings of the 18th ECC, IFAC, Sankt Petersburg, pp 35–40

  30. Kárný M, Alizadeh Z (2019) Towards fully probabilistic cooperative decision making. In: Slavkovik M (ed) Multi-agent systems, EUMAS 2018, vol LNAI 11450. Springer Nature, Dordrecht, pp 1–16

    Google Scholar 

  31. Kárný M, Guy T (2012) On support of imperfect Bayesian participants. In: Guy T et al (eds) Decision making with imperfect decision makers, vol 28. Springer, Int, Syst. Ref. Lib., Berlin, pp 29–56

    Chapter  Google Scholar 

  32. Kárný M, Guy T (2019) Preference elicitation within framework of fully probabilistic design of decision strategies. In: IFAC International Workshop on Adaptive and Learning Control Systems, vol 52. pp 239–244

  33. Kárný M, Hůla F (2019) Balancing exploitation and exploration via fully probabilistic design of decision policies. In: Proceedings of the 11th International Conference on Agents and Artificial Intelligence: ICAART, vol 2. pp 857–864

  34. Kárný M, Kroupa T (2012) Axiomatisation of fully probabilistic design. Inf Sci 186(1):105–113

    Article  MathSciNet  MATH  Google Scholar 

  35. Kárný M, Halousková A, Böhm J, Kulhavý R, Nedoma P (1985) Design of linear quadratic adaptive control: theory and algorithms for practice. Kybernetika 21(supp. Nos 3–6):1–96

  36. Kárný M, Böhm J, Guy T, Jirsa L, Nagy I, Nedoma P, Tesař L (2006) Optimized Bayesian dynamic advising: theory and algorithms. Springer, London

    Google Scholar 

  37. Kárný M, Bodini A, Guy T, Kracík J, Nedoma P, Ruggeri F (2014) Fully probabilistic knowledge expression and incorporation. Stat Interface 7(4):503–515

    Article  MathSciNet  MATH  Google Scholar 

  38. Klenske E, Hennig P (2016) Dual control for approximate Bayesian reinforcement learning. J Mach Learn Res 17:1–30

    MathSciNet  MATH  Google Scholar 

  39. Kober J, Peters J (2011) Policy search for motor primitives in robotics. Mach Learn 84(1):171–203.

    Article  MathSciNet  MATH  Google Scholar 

  40. Kracík J, Kárný M (2005) Merging of data knowledge in Bayesian estimation. In: Filipe J et al (eds) Proceedings of the 2nd International Conference on informatics in control, automation and robotics, Barcelona, pp 229–232

  41. Kulhavý R, Zarrop MB (1993) On a general concept of forgetting. Int J Control 58(4):905–924

    Article  MathSciNet  MATH  Google Scholar 

  42. Kullback S, Leibler R (1951) On information and sufficiency. Ann Math Stat 22:79–87

    Article  MathSciNet  MATH  Google Scholar 

  43. Kumar EV, Jerome J, Srikanth K (2014) Algebraic approach for selecting the weighting matrices of linear quadratic regulator. In: 2014 International Conference on green computing communication and electrical engineering (ICGCCEE), pp 1–6.

  44. Kumar P (1985) A survey on some results in stochastic adaptive control. SIAM J Control Appl 23:399–409

    MathSciNet  Google Scholar 

  45. Larsson D, Braun D, Tsiotrasz P (2017) Hierarchical state abstractions for decision-making problems with computational constraints. arXiv:1710.07990v1 [cs.AI], Accessed 22 Oct 2017

  46. Lee K, Kim G, Ortega P, Lee D, Kim K (2019) Bayesian optimistic Kullback-Leibler exploration. Mach Learn 108(5):765–783.

    Article  MathSciNet  MATH  Google Scholar 

  47. Li W, Song H (2016) ART: an attack-resistant trust management scheme for securing vehicular ad hoc networks. IEEE Trans Intell Transport Syst 17:960–969

    Article  Google Scholar 

  48. Liao Y, Deschamps F, Loures E, Ramos L (2017) Past, present and future of industry 4.0—a systematic literature review and research agenda proposal. Int J Prod Res 55(12):3609–3629

    Article  Google Scholar 

  49. Mayne D (2014) Model predictive control: recent developments and future promise. Automatica 50:2967–2986

    Article  MathSciNet  MATH  Google Scholar 

  50. Meditch J (1969) Stochastic optimal linear estimation and control. McGraw Hill, New York

    MATH  Google Scholar 

  51. Mesbah A (2018) Stochastic model predictive control with active uncertainty learning: a survey on dual control. Ann Rev Control 45:107–117.

  52. Moerland TM, Broekens J, Jonker CM (2018) Emotion in reinforcement learning agents and robots: a survey. Mach Learn 107(2):443–480.

    Article  MathSciNet  Google Scholar 

  53. Ouyang Y, Gagrani M, Nayyar A, Jain R (2017) Learning unknown Markov decision processes: a Thompson sampling approach. In: von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R (eds) Advances in neural information processing systems 30. Curran Associates, Inc., pp 1333–1342

  54. Peterka V (1972) On steady-state minimum variance control strategy. Kybernetika 8:219–231

    MathSciNet  MATH  Google Scholar 

  55. Peterka V (1975) A square-root filter for real-time multivariable regression. Kybernetika 11:53–67

    MathSciNet  MATH  Google Scholar 

  56. Peterka V (1981) Bayesian system identification. In: Eykhoff P (ed) Trends and progress in system identification. Perg. Press, pp 239–304

    Chapter  Google Scholar 

  57. Peterka V (1991) Adaptation for LQG control design to engineering needs. In: Warwick K, Kárný M, Halousková A (eds) Lecture notes: adv. methods in adaptive control for industrial application; Joint UK-CS seminar, vol 158. Springer-Verlag, NY

  58. Peterka V, Astrom K (1973) Control of multivariable systems with unknown but constant parameters. In: Prepr. of the 3rd IFAC Symp. on identification and process parameter estimation, IFAC, Hague, Delft, pp 534–544

  59. Puterman M (2005) Markov decision processes: discrete stochastic dynamic programming. Wiley, Hoboken

    MATH  Google Scholar 

  60. Quinn A, Kárný M, Guy T (2016) Fully probabilistic design of hierarchical Bayesian models. Inf Sci 369:532–547

    Article  MathSciNet  MATH  Google Scholar 

  61. Rao M (1987) Measure theory and integration. Wiley, Hoboken

    MATH  Google Scholar 

  62. Rohrs C, Valavani L, Athans M, Stein G (1982) Robustness of adaptive control algorithms in the presence of unmodeled dynamics. In: IEEE Conference on Decision and Control, Orlando, FL, vol 1, pp 3–11

  63. Sandholm T (1999) Distributed rational decision making. In: Weiss G (ed) Multiagent systems—a modern approach to distributed artificial intelligence. MIT Press, Cambridge, pp 201–258

    Google Scholar 

  64. Savage L (1954) Foundations of statistics. Wiley, Hoboken

    MATH  Google Scholar 

  65. Schweighofer N, Doya K (2003) Meta-learning in reinforcement learning. Neural Netw 16(1):5–9.

    Article  Google Scholar 

  66. Shannon C (1948) A mathematical theory of communication. Bell Syst Tech J 27(379–423):623–656

    Article  MathSciNet  MATH  Google Scholar 

  67. Shore J, Johnson R (1980) Axiomatic derivation of the principle of maximum entropy & the principle of minimum cross-entropy. IEEE Trans Inf Th 26(1):26–37

    Article  MathSciNet  MATH  Google Scholar 

  68. Si J, Barto A, Powell W, Wunsch D (eds) (2004) Handbook of learning and approximate dynamic programming. Wiley-IEEE Press, Hoboken

    Google Scholar 

  69. Tanner M (1993) Tools for statistical inference. Springer Verlag, New York

    Book  MATH  Google Scholar 

  70. Tao G (2014) Multivariable adaptive control: a survey. Automatica 50(11):2737–2764

    Article  MathSciNet  MATH  Google Scholar 

  71. Ullrich M (1964) Optimum control of some stochastic systems. In: Prepr. of the VIII-th conf. ETAN, Beograd

  72. Wolpert D, Macready W (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82

    Article  Google Scholar 

  73. Wu H, Guo X, Liu X (2017) Adaptive exploration-exploitation trade off for opportunistic bandits. Preprint at arXiv:1709.04004

  74. Yang Z, Wang C, Zhang Z, Li J (2019) Mini-batch algorithms with online step size. Knowledge-Based Systems 165:228–240

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Miroslav Kárný.

Ethics declarations

Conflicts of interest

The author has no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript. This manuscript has not been submitted to, nor is under review at, another journal or other publishing venue.


The reported research has been supported by MŠMT ČR LTC18075 and EU-COST Action CA16228.

Availability of data and material

Not applicable

Code availability

The source code of the example is available upon a request.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kárný, M. Towards on-line tuning of adaptive-agent’s multivariate meta-parameter. Int. J. Mach. Learn. & Cyber. 12, 2717–2731 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Bayesian learning
  • Adaptive agent
  • Meta-parameter tuning
  • Fully probabilistic design
  • Kullback–Leibler divergence
  • Dynamic decision making