Self-Optimizing and Pareto-Optimal Policies in General Environments Based on Bayes-Mixtures

Purchase on

$29.95 / €24.95 / £19.95*

* Final gross prices may vary according to local VAT.

Get Access


The problem of making sequential decisions in unknown probabilistic environments is studied. In cycle t action y t results in perception x t and reward r t , where all quantities in general may depend on the complete history. The perception x t and reward r t are sampled from the (reactive) environmental probability distribution μ. This very general setting includes, but is not limited to, (partial observable, k-th order) Markov decision processes. Sequential decision theory tells us how to act in order to maximize the total expected reward, called value, if μ is known. Reinforcement learning is usually used if μ is unknown. In the Bayesian approach one defines a mixture distribution ξ as a weighted sum of distributions $ \mathcal{V} \in \mathcal{M} $ , where $ \mathcal{M} $ is any class of distributions including the true environment μ. We show that the Bayes-optimal policy p ξbased on the mixture ξ is self-optimizing in the sense that the average value converges asymptotically for all $ \mu \in \mathcal{M} $ to the optimal value achieved by the (infeasible) Bayes-optimal policy p μ which knows μ in advance. We show that the necessary condition that $ \mathcal{M} $ admits self-optimizing policies at all, is also sufficient. No other structural assumptions are made on $ \mathcal{M} $ . As an example application, we discuss ergodic Markov decision processes, which allow for self-optimizing policies. Furthermore, we show that pλ is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in all environments $ \mathcal{V} \in \mathcal{M} $ and a strictly higher value in at least one.