Theory in Biosciences

, Volume 131, Issue 3, pp 139–148 | Cite as

An information-theoretic approach to curiosity-driven reinforcement learning

  • Susanne Still
  • Doina Precup
Original Paper


We provide a fresh look at the problem of exploration in reinforcement learning, drawing on ideas from information theory. First, we show that Boltzmann-style exploration, one of the main exploration methods used in reinforcement learning, is optimal from an information-theoretic point of view, in that it optimally trades expected return for the coding cost of the policy. Second, we address the problem of curiosity-driven learning. We propose that, in addition to maximizing the expected return, a learner should choose a policy that also maximizes the learner’s predictive power. This makes the world both interesting and exploitable. Optimal policies then have the form of Boltzmann-style exploration with a bonus, containing a novel exploration–exploitation trade-off which emerges naturally from the proposed optimization principle. Importantly, this exploration–exploitation trade-off persists in the optimal deterministic policy, i.e., when there is no exploration due to randomness. As a result, exploration is understood as an emerging behavior that optimizes information gain, rather than being modeled as pure randomization of action choices.


Reinforcement learning Exploration–exploitation trade-off Information theory Rate distortion theory Curiosity Adaptive behavior 



This research was funded in part by NSERC and ONR.


  1. Ay N, Bertschinger N, Der R, Guttler F, Olbrich E (2008) Predictive information and explorative behavior of autonomous robots. Eur Phys J B 63:329–339CrossRefGoogle Scholar
  2. Azar MG, Kappen HJ (2010) Dynamic policy programming. J Mach Learn Res arXiv:1004.2027:1–26Google Scholar
  3. Bagnell JA, Schneider J (2003) Covariant policy search. In: International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, MexicoGoogle Scholar
  4. Bialek W, Nemenman I, Tishby N (2001) Predictability, complexity and learning. Neural Comput 13:2409–2463PubMedCrossRefGoogle Scholar
  5. Brafman RI, Tennenholtz M (2002) R-max—a general polynomial time algorithm for near-optimal reinforcement learning. J Mach Learn Res 3:213–231Google Scholar
  6. Chechnik G, Globerson A, Tishby N, Weiss Y (2005) Information bottleneck for gaussian variables. J Mach Learn Res 6:165–188Google Scholar
  7. Chigirev DV, Bialek W (2004) Optimal manifold representation of data: an information theoretic perspective. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, Cambridge, MAGoogle Scholar
  8. Crutchfield JP, Feldman DP (2001) Synchronizing to the environment: information theoretic limits on agent learning. Adv Complex Syst 4(2):251–264Google Scholar
  9. Crutchfield JP, Feldman DP (2003) Regularities unseen, randomness observed: levels of entropy convergence. Chaos 13(1):25–54PubMedCrossRefGoogle Scholar
  10. Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106(4):620–630CrossRefGoogle Scholar
  11. Kearns M, Singh S (Eds) (1998) Near-optimal reinforcement learning in polynomial time. In: Proceedings of the 15th International Conference on Machine Learning, pp 260–268Google Scholar
  12. Little DY, Sommer FT (2011) Learning in embodied action-perception loops through exploration. arXiv:1112.1125v2Google Scholar
  13. Oudeyer P-Y, Kaplan F, Hafner V (2007) Intrinsic motivation systems for autonomous mental development. IEEE Trans Evol Comput 11(2):265–286CrossRefGoogle Scholar
  14. Pereira F, Tishby N, Lee L (1993) Distributional clustering of english words. In 30th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp 183–190.
  15. Peters J, Muelling K, Altun Y (2010) Relative entropy policy search. In: Proceedings of the Twenty-Fourth National Conference on Artificial Intelligence (AAAI). AAAI Press, Menlo ParkGoogle Scholar
  16. Peters J, Schaal S (2008) Reinforcement learning of motor skills with policy gradients. Neural Netw 21(4):682–697PubMedCrossRefGoogle Scholar
  17. Ratitch B, Precup D (2003) Using MDP characteristics to guide exploration in reinforcement learning. In: Proceedings of ECML, pp 313–324Google Scholar
  18. Rose K (1998) Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proc IEEE 86(11):2210–2239CrossRefGoogle Scholar
  19. Rose K, Gurewitz E, Fox GC (1990) Statistical mechanics and phase transitions in clustering. Phys Rev Lett 65(8):945–948PubMedCrossRefGoogle Scholar
  20. Schmidhuber J (1991) Curious model-building control systems. In Proceedings of IJCNN, pp 1458–1463Google Scholar
  21. Schmidhuber J (2009) Art and science as by-products of the search for novel patterns, or data compressible in unknown yet learnable ways. In: Multiple ways to design research. Research cases that reshape the design discipline. Swiss Design Network—et al. Edizioni, 2009, pp 98–112Google Scholar
  22. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656Google Scholar
  23. Shaw R (1984) The dripping faucet as a model chaotic system. Aerial Press, Santa Cruz, CaliforniaGoogle Scholar
  24. Singh S, Barto AG, Chentanez N (2005) Intrinsically motivated reinforcement learning. In Proceedings of NIPS, pp 1281–1288Google Scholar
  25. Still S (2009) Information-theoretic approach to interactive learning. EPL 85 28005. doi: 10.1209/0295-5075/85/28005
  26. Still S, Bialek W (2004) How many clusters? An information theoretic perspective. Neural Computation 16(12):2483–2506PubMedCrossRefGoogle Scholar
  27. Still S, Bialek W, Bottou L (2004) Geometric clustering using the information bottleneck method. In: Thrun S, Saul LK, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, Cambridge, MAGoogle Scholar
  28. Strehl AL, Li L, Littman ML (2006) Incremental model-based learners with formal learning-time guarantees. In: Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, Cambridge, MAGoogle Scholar
  29. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, CambridgeGoogle Scholar
  30. Taylor ME, Stone P (2009) Transfer learning for reinforcement learning domains: a survey. J Mach Learn Res 10(1):1633–1685Google Scholar
  31. Thrun S, Moeller K (1992) Active exploration in dynamic environments. In: Advances in Neural Information Processing Systems (NIPS) 4, San Mateo, CA, pp 531–538Google Scholar
  32. Tishby N, Pereira F, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37th Annual Allerton Conference, pp 363–377Google Scholar
  33. Tishby N, Polani D (2010) Information theory of decisions and actions. In: Perception-reason-action cycle: models, algorithms and systems. Springer, New YorkGoogle Scholar
  34. Todorov E (2009) Efficient computation of optimal actions. Proc Nat Acad Sci USA 106(28):11478–11483PubMedCrossRefGoogle Scholar
  35. Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, Cambridge UniversityGoogle Scholar
  36. Wingate D, Singh S (2007) On discovery and learning of models with predictive representations of state for agents with continuous actions and observations. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp 1128–1135Google Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  1. 1.Information and Computer SciencesUniversity of Hawaii at MānoaHonoluluUSA
  2. 2.School of Computer ScienceMcGill UniversityMontrealCanada

Personalised recommendations