Date: 25 Jun 2002

Self-Optimizing and Pareto-Optimal Policies in General Environments Based on Bayes-Mixtures

* Final gross prices may vary according to local VAT.

Get Access

Abstract

The problem of making sequential decisions in unknown probabilistic environments is studied. In cycle t action y t results in perception x t and reward r t , where all quantities in general may depend on the complete history. The perception x t and reward r t are sampled from the (reactive) environmental probability distribution μ. This very general setting includes, but is not limited to, (partial observable, k-th order) Markov decision processes. Sequential decision theory tells us how to act in order to maximize the total expected reward, called value, if μ is known. Reinforcement learning is usually used if μ is unknown. In the Bayesian approach one defines a mixture distribution ξ as a weighted sum of distributions \( \mathcal{V} \in \mathcal{M} \) , where \( \mathcal{M} \) is any class of distributions including the true environment μ. We show that the Bayes-optimal policy p ξbased on the mixture ξ is self-optimizing in the sense that the average value converges asymptotically for all \( \mu \in \mathcal{M} \) to the optimal value achieved by the (infeasible) Bayes-optimal policy p μ which knows μ in advance. We show that the necessary condition that \( \mathcal{M} \) admits self-optimizing policies at all, is also sufficient. No other structural assumptions are made on \( \mathcal{M} \) . As an example application, we discuss ergodic Markov decision processes, which allow for self-optimizing policies. Furthermore, we show that pλ is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in all environments \( \mathcal{V} \in \mathcal{M} \) and a strictly higher value in at least one.