SelfOptimizing and ParetoOptimal Policies in General Environments Based on BayesMixtures
 Marcus Hutter
 … show all 1 hide
Abstract
The problem of making sequential decisions in unknown probabilistic environments is studied. In cycle t action y _{ t } results in perception x _{ t } and reward r _{ t }, where all quantities in general may depend on the complete history. The perception x _{ t } and reward r _{ t } are sampled from the (reactive) environmental probability distribution μ. This very general setting includes, but is not limited to, (partial observable, kth order) Markov decision processes. Sequential decision theory tells us how to act in order to maximize the total expected reward, called value, if μ is known. Reinforcement learning is usually used if μ is unknown. In the Bayesian approach one defines a mixture distribution ξ as a weighted sum of distributions \( \mathcal{V} \in \mathcal{M} \) , where \( \mathcal{M} \) is any class of distributions including the true environment μ. We show that the Bayesoptimal policy p ^{ξ}based on the mixture ξ is selfoptimizing in the sense that the average value converges asymptotically for all \( \mu \in \mathcal{M} \) to the optimal value achieved by the (infeasible) Bayesoptimal policy p ^{μ} which knows μ in advance. We show that the necessary condition that \( \mathcal{M} \) admits selfoptimizing policies at all, is also sufficient. No other structural assumptions are made on \( \mathcal{M} \) . As an example application, we discuss ergodic Markov decision processes, which allow for selfoptimizing policies. Furthermore, we show that p^{λ} is Paretooptimal in the sense that there is no other policy yielding higher or equal value in all environments \( \mathcal{V} \in \mathcal{M} \) and a strictly higher value in at least one.
 Bellman, R. (1957) Dynamic Programming. Princeton University Press, New Jersey
 Bertsekas, D. P. (1995) Dynamic Programming and Optimal Control, Vol. (I) and (II). Athena Scientific, Belmont, Massachusetts
 Brafman, R. I., Tennenholtz, M. (2000) A nearoptimal polynomial time algorithm for learning in certain classes of stochastic games. Artificial Intelligence 121: pp. 3147 CrossRef
 Doob, J. L. (1953) Stochastic Processes. John Wiley & Sons, New York
 M. Hutter. A theory of universal artificial intelligence based on algorithmic complexity. Technical Report cs.AI/0004001, 62 pages, 2000. http://arxiv.org/abs/cs.AI/0004001.
 M. Hutter. General loss bounds for universal sequence prediction. Proceedings of the 18 ^{ th } International Conference on Machine Learning (ICML2001), pages 210–217, 2001.
 Kaelbling, L. P., Littman, M. L., Moore, A. W. (1996) Reinforcement learning: a survey. Journal of AI research 4: pp. 237285
 Kearns, M., Singh, S. (1998) Nearoptimal reinforcement learning in polynomial time. Proc. 15th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, pp. 260268
 Kumar, P. R., Varaiya, P. P. (1986) Stochastic Systems: Estimation, Identification, and Adaptive Control. Prentice Hall, Englewood Cliffs, NJ
 M. Li and P. M. B. Vitányi. An introduction to Kolmogorov complexity and its applications. Springer, 2nd edition, 1997.
 Russell, S. J., Norvig, P. (1995) Artificial Intelligence. A Modern Approach. PrenticeHall, Englewood Cliffs
 Sutton, R., Barto, A. (1998) Reinforcement learning: An introduction. MIT Press, Cambridge, MA
 J. Schmidhuber. The Speed Prior: a new simplicity measure yielding nearoptimal computable predictions. Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 2002), 2002.
 Solomonoff, R. J. (1978) Complexitybased induction systems: comparisons and convergence theorems. IEEE Trans. Inform. Theory IT24: pp. 422432 CrossRef
 Title
 SelfOptimizing and ParetoOptimal Policies in General Environments Based on BayesMixtures
 Book Title
 Computational Learning Theory
 Book Subtitle
 15th Annual Conference on Computational Learning Theory, COLT 2002 Sydney, Australia, July 8–10, 2002 Proceedings
 Pages
 pp 364379
 Copyright
 2002
 DOI
 10.1007/3540454357_25
 Print ISBN
 9783540438366
 Online ISBN
 9783540454359
 Series Title
 Lecture Notes in Computer Science
 Series Volume
 2375
 Series ISSN
 03029743
 Publisher
 Springer Berlin Heidelberg
 Copyright Holder
 SpringerVerlag Berlin Heidelberg
 Additional Links
 Topics
 Industry Sectors
 eBook Packages
 Editors

 Jyrki Kivinen ^{(1)}
 Robert H. Sloan ^{(2)}
 Editor Affiliations

 1. Research School of Information Sciences and Engineering, Australian National University
 2. Computer Science Department, University of Illinois at Chicago
 Authors

 Marcus Hutter ^{(5)}
 Author Affiliations

 5. IDSIA, Galleria 2, CH6928, MannoLugano, Switzerland
Continue reading...
To view the rest of this content please follow the download PDF link above.