Advertisement

Feature Reinforcement Learning in Practice

  • Phuong Nguyen
  • Peter Sunehag
  • Marcus Hutter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7188)

Abstract

Following a recent surge in using history-based methods for resolving perceptual aliasing in reinforcement learning, we introduce an algorithm based on the feature reinforcement learning framework called ΦMDP [13]. To create a practical algorithm we devise a stochastic search procedure for a class of context trees based on parallel tempering and a specialized proposal distribution. We provide the first empirical evaluation for ΦMDP. Our proposed algorithm achieves superior performance to the classical U-tree algorithm [20] and the recent active-LZ algorithm [6], and is competitive with MC-AIXI-CTW [29] that maintains a bayesian mixture over all context trees up to a chosen depth. We are encouraged by our ability to compete with this sophisticated method using an algorithm that simply picks one single model, and uses Q-learning on the corresponding MDP. Our ΦMDP algorithm is simpler and consumes less time and memory. These results show promise for our future work on attacking more complex and larger problems.

Keywords

Markov Decision Process Code Length Minimum Description Length Proposal Distribution Stochastic Search 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Akaike, H.: A new look at the statistical model identification. IEEE Transactions on Automatic Control 19, 716–723 (1974)MathSciNetMATHCrossRefGoogle Scholar
  2. 2.
    Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Anthena Scientific, Belmont (1996)MATHGoogle Scholar
  3. 3.
    Brafman, R.I., Tennenholz, M.: R-max -a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learing Research 3, 213–231 (2002)Google Scholar
  4. 4.
    Chrisman, L.: Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In: AAAI, pp. 183–188 (1992)Google Scholar
  5. 5.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Willey and Sons (1991)Google Scholar
  6. 6.
    Farias, V., Moallemi, C., Van Roy, B., Weissman, T.: Universal reinforcement learning. IEEE Transactions on Information Theory 56(5), 2441–2454 (2010)CrossRefGoogle Scholar
  7. 7.
    Geyer, C.J.: Markov chain Monte Calro maximum likelihood. In: Computing Science and Statistics: the 23rd Symposium on the Interface, pp. 156–163. Interface Foundation, Fairfax (1991)Google Scholar
  8. 8.
    Givan, R., Dean, T., Greig, M.: Equivalence notions and model minimization in Markov decision process. Artificial Intelligence 147, 163–223 (2003)MathSciNetMATHCrossRefGoogle Scholar
  9. 9.
    Granville, V., Křivánek, M., Rasson, J.P.: Simulated annealing: A proof of convergence. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(6), 652–656 (1994)CrossRefGoogle Scholar
  10. 10.
    Grünwald, P.D.: The Minimum Description Length Principle. The MIT Press (2007)Google Scholar
  11. 11.
    Hukushima, K., Nemoto, K.: Exchange Monte Carlo method and application to spin glass simulations. Journal of the Physical Socieity of Japan 65(4), 1604–1608 (1996)CrossRefGoogle Scholar
  12. 12.
    Hutter, M.: Universal Articial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin (2005)Google Scholar
  13. 13.
    Hutter, M.: Feature reinforcement learning: Part I. Unstructured MDPs. Journal of General Artificial Intelligence (2009)Google Scholar
  14. 14.
    Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in paritally observable stochastic domains. Artifical Intelligence 101, 99–134 (1998)MathSciNetMATHCrossRefGoogle Scholar
  15. 15.
    Kocsis, L., Szepesvári, C.: Bandit Based Monte-Carlo Planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  16. 16.
    Li, L., Walsh, T.J., Littmans, M.L.: Towards a unified theory of state abstraction for MDPs. In: Proceedings of the 9th International Symposium on Artificial Intelligence and Mathematics (2006)Google Scholar
  17. 17.
    Liu, J.S.: Monte Carlo Strategies in Scientific Computing. Springer, Heidelberg (2001)MATHGoogle Scholar
  18. 18.
    Madani, O., Handks, S., Condon: On the undecidability of probabilistic planning and related stochastic optimization problems. Artifical Intelligence 147, 5–34 (2003)MATHCrossRefGoogle Scholar
  19. 19.
    Mahmud, M.M.H.: Constructing states for reinforcement learning. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning (ICML 2010), Haifa, Israel, pp. 727–734 (June 2010), http://www.icml2010.org/papers/593.pdf
  20. 20.
    McCallum, A.K.: Reinforcement Learning with Selective Perception and Hidden State. Ph.D. thesis, Department of Computer Science, University of Rochester (1996)Google Scholar
  21. 21.
    Nguyen, P., Sunehag, P., Hutter, M.: Feature refinrocement learning in practice. Tech. rep., Australian National University (2011)Google Scholar
  22. 22.
    Poland, J., Hutter, M.: Universal learning of repeated matrix games. In: Proc. 15th Annual Machine Learning Conf. of Belgium and The Netherlands (Benelearn 2006), pp. 7–14. Ghent (2006), http://arxiv.org/abs/cs.LG/0508073
  23. 23.
    Rissanen, J.: A universal data compression system. IEEE Transactions on Information Theory 29(5), 656–663 (1983)MathSciNetMATHCrossRefGoogle Scholar
  24. 24.
    Schneider, J., Kirkpatrick, S.: Stochastic Optimization, 1st edn. Springer, Heidelberg (2006)MATHGoogle Scholar
  25. 25.
    Singh, S.P., James, M.R., Rudary, M.R.: Predictive state representations: A new theory for modeling dynamical systems. In: Proceedings of the 20th Conference in Uncertainty in Artificial Intelligence, Banff, Canada, pp. 512–518 (2004)Google Scholar
  26. 26.
    Suman, B., Kumar, P.: A survey of simulated annealing as a tool for single and multiobjecctive optimization. Journal of the Operational Research Society 57, 1143–1160 (2006)MATHCrossRefGoogle Scholar
  27. 27.
    Sunehag, P., Hutter, M.: Consistency of Feature Markov Processes. In: Hutter, M., Stephan, F., Vovk, V., Zeugmann, T. (eds.) ALT 2010. LNCS(LNAI), vol. 6331, pp. 360–374. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  28. 28.
    Sutton, R., Barto, A.: Reinforcement Learning. The MIT Press (1998)Google Scholar
  29. 29.
    Veness, J., Ng, K.S., Hutter, M., Uther, W., Silver, D.: A Monte-Carlo AIXI approximation. Journal of Artifiicial Intelligence Research 40(1), 95–142 (2011)MathSciNetMATHGoogle Scholar
  30. 30.
    Vidal, E., Thollard, F., Higuera, C.D.L., Casacuberta, F., Carrasco, R.C.: Probabilitic finite-state machines. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(7), 1013–1025 (2005)CrossRefGoogle Scholar
  31. 31.
    Wallace, C.S.: Statistical and Inductive Inference by Minimum Message Length. Springer, Berlin (2005)MATHGoogle Scholar
  32. 32.
    Wilems, F.M.J., Shtarkov, Y.M., Tjalkens, T.J.: The context tree weighting method: Basic properties. IEEE Transactions on Information Theory 41, 653–664 (1995)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Phuong Nguyen
    • 1
    • 2
  • Peter Sunehag
    • 1
  • Marcus Hutter
    • 1
    • 2
    • 3
  1. 1.Australian National UniversityAustralia
  2. 2.NICTAAustralia
  3. 3.ETHZSwitzerland

Personalised recommendations