Exploration from Generalization Mediated by Multiple Controllers

  • Peter Dayan


Intrinsic motivation involves internally governed drives for exploration, curiosity, and play. These shape subjects over the course of development and beyond to explore to learn and expand the actions they are capable of performing and to acquire skills that can be useful in future domains. We adopt a utilitarian view of this learning process, treating it in terms of exploration bonuses that arise from distributions over the structure of the world that imply potential benefits from generalizing knowledge and skills to subsequent environments. We discuss how functionally and architecturally different controllers may realize these bonuses in different ways.


Intrinsic Motivation Dirichlet Process Mixture Markov Decision Problem Hierarchical Dirichlet Process Exploration Bonus 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



I am very grateful to Andrew Barto, the editors, and two anonymous reviewers for their comments on this chapter. My work is funded by the Gatsby Charitable Foundation.


  1. .
    Acuna, D., Schrater, P.: Improving bayesian reinforcement learning using transition abstraction. In: ICML/UAI/COLT Workshop on Abstraction in Reinforcement Learning. Montreal, Canada (2009)Google Scholar
  2. .
    Asmuth, J., Li, L., Littman, M., Nouri, A., Wingate, D.: A bayesian sampling approach to exploration in reinforcement learning. In: UAI, Montreal, Canada (2009)Google Scholar
  3. .
    Aston-Jones, G., Cohen, J.D.: An integrative theory of locus coeruleus-norepinephrine function: Adaptive gain and optimal performance. Annu. Rev. Neurosci. 28, 403–450 (2005)CrossRefGoogle Scholar
  4. .
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2), 235–256 (2002a)zbMATHCrossRefGoogle Scholar
  5. .
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002b)MathSciNetzbMATHCrossRefGoogle Scholar
  6. .
    Balleine, B.W.: Neural bases of food-seeking: Affect, arousal and reward in corticostriatolimbic circuits. Physiol. Behav. 86(5), 717–730 (2005)CrossRefGoogle Scholar
  7. .
    Bandler, R., Shipley, M.T.: Columnar organization in the midbrain periaqueductal gray: Modules for emotional expression? Trends Neurosci. 17(9), 379–389 (1994)CrossRefGoogle Scholar
  8. .
    Barto, A.: Adaptive critics and the basal ganglia. In: Houk, J., Davis, J., Beiser, D. (eds.) Models of Information Processing in the Basal Ganglia, pp. 215–232. MIT, Cambridge (1995)Google Scholar
  9. .
    Barto, A., Mahadevan, S.: Recent advances in hierarchical reinforcement learning. Discr. Event Dyn. Syst. 13(4), 341–379 (2003)MathSciNetCrossRefGoogle Scholar
  10. .
    Barto, A., Singh, S., Chentanez, N.: Intrinsically motivated learning of hierarchical collections of skills. In: ICDL 2004, La Jolla, CA (2004)Google Scholar
  11. .
    Barto, A., Sutton, R., Anderson, C.: Neuronlike elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 13(5), 834–846 (1983)CrossRefGoogle Scholar
  12. .
    Barto, A.G.: Intrinsic motivation and reinforcement learning. In: Baldassarre, G., Mirolli, M. (eds.) Intrinsically Motivated Learning in Natural and Artificial Systems, pp. 17–47. Springer, Berlin (2012)Google Scholar
  13. .
    Beal, M., Ghahramani, Z., Rasmussen, C.: The infinite hidden Markov model. In: NIPS, pp. 577–584, Vancouver, Canada (2002)Google Scholar
  14. .
    Behrens, T.E.J., Woolrich, M.W., Walton, M.E., Rushworth, M.F.S.: Learning the value of information in an uncertain world. Nat. Neurosci. 10(9), 1214–1221 (2007)CrossRefGoogle Scholar
  15. .
    Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton (1957)zbMATHGoogle Scholar
  16. .
    Berridge, K.C.: Motivation concepts in behavioral neuroscience. Physiol. Behav. 81, 179–209 (2004)CrossRefGoogle Scholar
  17. .
    Berry, D.A., Fristedt, B.: Bandit Problems: Sequential Allocation of Experiments. Springer, Berlin (1985)zbMATHCrossRefGoogle Scholar
  18. .
    Blanchard, D.C., Blanchard, R.J.: Ethoexperimental approaches to the biology of emotion. Annu. Rev. Psychol. 39, 43–68 (1988)CrossRefGoogle Scholar
  19. .
    Blank, D., Kumar, D., Meeden, L., Marshall, J.: Bringing up robot: Fundamental mechanisms for creating a self-motivated, self-organizing architecture. Cybern. Syst. 36(2), 125–150 (2005)zbMATHCrossRefGoogle Scholar
  20. .
    Bolles, R.C.: Species-specific defense reactions and avoidance learning. Psychol. Rev. 77, 32–48 (1970)CrossRefGoogle Scholar
  21. .
    Botvinick, M.M., Niv, Y., Barto, A.C.: Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition 113(3), 262–280 (2009)CrossRefGoogle Scholar
  22. .
    Boureau, Y.-L., Dayan, P.: Opponency revisited: Competition and cooperation between dopamine and serotonin. Neuropsychopharmacology 36(1), 74–97 (2011)CrossRefGoogle Scholar
  23. .
    Brafman, R., Tennenholtz, M.: R-max-a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2003)MathSciNetzbMATHGoogle Scholar
  24. .
    Breland, K., Breland, M.: The misbehavior of organisms. Am. Psychol. 16(9), 681–84 (1961)CrossRefGoogle Scholar
  25. .
    Carpenter, G., Grossberg, S.: The ART of adaptive pattern recognition by a self-organizing neural network. Computer 21, 77–88 (1988)CrossRefGoogle Scholar
  26. .
    Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997)MathSciNetCrossRefGoogle Scholar
  27. .
    Collins, A.: Apprentissage et Contrôle Cognitif: Une Théorie de la Fonction Executive Préfrontale Humaine. Ph.D. Thesis, Université Pierre et Marie Curie, Paris (2010)Google Scholar
  28. .
    Courville, A., Daw, N., Touretzky, D.: Similarity and discrimination in classical conditioning: A latent variable account. In: NIPS, pp. 313–320, Vancouver, Canada (2004)Google Scholar
  29. .
    Daw, N.D., Doya, K.: The computational neurobiology of learning and reward. Curr. Opin. Neurobiol. 16(2), 199–204 (2006)CrossRefGoogle Scholar
  30. .
    Daw, N.D., Kakade, S., Dayan, P.: Opponent interactions between serotonin and dopamine. Neural Netw. 15, 603–16 (2002)CrossRefGoogle Scholar
  31. .
    Daw, N.D., Niv, Y., Dayan, P.: Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat. Neurosci. 8(12), 1704–1711 (2005)CrossRefGoogle Scholar
  32. .
    Daw, N.D., O’Doherty, J.P., Dayan, P., Seymour, B., Dolan, R.J.: Cortical substrates for exploratory decisions in humans. Nature 441(7095), 876–879 (2006)CrossRefGoogle Scholar
  33. .
    Dayan, P.: Bilinearity, rules, and prefrontal cortex. Front. Comput. Neurosci. 1, 1 (2007)CrossRefGoogle Scholar
  34. .
    Dayan, P., Hinton, G.: Feudal reinforcement learning. In: Hanson, S.J., Cowan, J.D., Giles, C.L. (eds.) Advances in Neural Information Processing Systems (NIPS) 5. MIT, Cambridge (1993)Google Scholar
  35. .
    Dayan, P., Huys, Q.J.M.: Serotonin, inhibition, and negative mood. PLoS Comput. Biol. 4(2), e4 (2008)CrossRefGoogle Scholar
  36. .
    Dayan, P., Huys, Q.J.M.: Serotonin in affective control. Annu. Rev. Neurosci. 32, 95–126 (2009)CrossRefGoogle Scholar
  37. .
    Dayan, P., Niv, Y., Seymour, B., Daw, N.D.: The misbehavior of value and the discipline of the will. Neural Netw. 19(8), 1153–1160 (2006)zbMATHCrossRefGoogle Scholar
  38. .
    Dayan, P., Sejnowski, T.: Exploration bonuses and dual control. Mach. Learn. 25(1), 5–22 (1996)Google Scholar
  39. .
    Deakin, J.F.W., Graeff, F.G.: 5-HT and mechanisms of defence. J. Psychopharmacol. 5, 305–316 (1991)CrossRefGoogle Scholar
  40. .
    Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: UAI, Stockholm, Sweden pp. 150–159 (1999)Google Scholar
  41. .
    Deci, E., Ryan, R.: Intrinsic motivation and self-determination in human behavior. Plenum, New York (1985)CrossRefGoogle Scholar
  42. .
    Dickinson, A.: Contemporary animal learning theory. Cambridge University Press, Cambridge (1980)Google Scholar
  43. .
    Dickinson, A., Balleine, B.: The role of learning in motivation. In: Gallistel, C. (ed.) Stevens’ Handbook of Experimental Psychology, vol. 3, pp. 497–533. Wiley, New York (2002)Google Scholar
  44. .
    Dietterich, T.: The MAXQ method for hierarchical reinforcement learning. In: ICML, pp. 118–126, Madison, Wisconsin, (1998)Google Scholar
  45. .
    Dietterich, T.: Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artif. Intell. Res. 13(1), 227–303 (2000)MathSciNetzbMATHGoogle Scholar
  46. .
    Doya, K.: Metalearning and neuromodulation. Neural Netw. 15(4–6), 495–506 (2002)CrossRefGoogle Scholar
  47. .
    Doya, K., Samejima, K., ichi Katagiri, K., Kawato, M.: Multiple model-based reinforcement learning. Neural Comput. 14(6), 1347–1369 (2002)Google Scholar
  48. .
    Duff, M.: Optimal Learning: Computational approaches for Bayes-adaptive Markov decision processes. Ph.D. Thesis, Computer Science Department, University of Massachusetts, Amherst (2000)Google Scholar
  49. .
    Foster, D., Dayan, P.: Structure in the space of value functions. Mach. Learn. 49(2), 325–346 (2002)zbMATHCrossRefGoogle Scholar
  50. .
    Gershman, S., Cohen, J., Niv, Y.: Learning to selectively attend. In: Proceedings of the 32nd Annual Conference of the Cognitive Science Society, Portland, Oregon (2010a)Google Scholar
  51. .
    Gershman, S., Niv, Y.: Learning latent structure: Carving nature at its joints. Curr. Opin. Neurobiol. (2010)Google Scholar
  52. .
    Gershman, S.J., Blei, D.M., Niv, Y.: Context, learning, and extinction. Psychol. Rev. 117(1), 197–209 (2010b)CrossRefGoogle Scholar
  53. .
    Gittins, J.C.: Multi-Armed Bandit Allocation Indices. Wiley, New York (1989)zbMATHGoogle Scholar
  54. .
    Goodkin, F.: Rats learn the relationship between responding and environmental events: An expansion of the learned helplessness hypothesis. Learn. Motiv. 7, 382–393 (1976)CrossRefGoogle Scholar
  55. .
    Gray, J.A., McNaughton, N.: The Neuropsychology of Anxiety, 2nd edn. OUP, Oxford (2003)CrossRefGoogle Scholar
  56. .
    Guthrie, E.: The Psychology of Learning. Harper & Row, New York (1952)Google Scholar
  57. .
    Hazy, T.E., Frank, M.J., O’reilly, R.C.: Towards an executive without a homunculus: Computational models of the prefrontal cortex/basal ganglia system. Philos. Trans. R. Soc. Lond. B Biol. Sci. 362(1485), 1601–1613 (2007)CrossRefGoogle Scholar
  58. .
    Hempel, C.M., Hartman, K.H., Wang, X.J., Turrigiano, G.G., Nelson, S.B.: Multiple forms of short-term plasticity at excitatory synapses in rat medial prefrontal cortex. J. Neurophysiol. 83(5), 3031–3041 (2000)Google Scholar
  59. .
    Hershberger, W.A.: An approach through the looking-glass. Anim. Learn. Behav. 14, 443–51 (1986)CrossRefGoogle Scholar
  60. .
    Hinton, G.E., Dayan, P., Frey, B.J., Neal, R.M.: The “wake-sleep” algorithm for unsupervised neural networks. Science 268(5214), 1158–1161 (1995)CrossRefGoogle Scholar
  61. .
    Hinton, G.E., Ghahramani, Z.: Generative models for discovering sparse distributed representations. Philos. Trans. R. Soc. Lond. B Biol. Sci. 352(1358), 1177–1190 (1997)CrossRefGoogle Scholar
  62. .
    Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)MathSciNetzbMATHCrossRefGoogle Scholar
  63. .
    Holland, P.: Amount of training affects associatively-activated event representation. Neuropharmacology 37(4–5), 461–469 (1998)CrossRefGoogle Scholar
  64. .
    Horvitz, J.C.: Mesolimbocortical and nigrostriatal dopamine responses to salient non-reward events. Neuroscience 96(4), 651–656 (2000)CrossRefGoogle Scholar
  65. .
    Horvitz, J.C., Stewart, T., Jacobs, B.L.: Burst activity of ventral tegmental dopamine neurons is elicited by sensory stimuli in the awake cat. Brain Res. 759(2), 251–258 (1997)CrossRefGoogle Scholar
  66. .
    Howard, R.: Information value theory. IEEE Trans. Syst. Sci. Cybern. 2(1), 22–26 (1966)CrossRefGoogle Scholar
  67. .
    Huang, X., Weng, J.: Inherent value systems for autonomous mental development. Int. J. Human. Robot. 4, 407–433 (2007)CrossRefGoogle Scholar
  68. .
    Hutter, M.: Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability. Springer, Berlin (2005)Google Scholar
  69. .
    Huys, Q.: Reinforcers and control. Towards a computational ætiology of depression. Ph.D. Thesis, Gatsby Computational Neuroscience Unit, UCL (2007)Google Scholar
  70. .
    Huys, Q.J.M., Dayan, P.: A Bayesian formulation of behavioral control. Cognition 113, 314–328 (2009)CrossRefGoogle Scholar
  71. .
    Ishii, S., Yoshida, W., Yoshimoto, J.: Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Netw. 15(4–6), 665–687 (2002)CrossRefGoogle Scholar
  72. .
    Kaelbling, L., Littman, M., Cassandra, A.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1–2), 99–134 (1998)MathSciNetzbMATHCrossRefGoogle Scholar
  73. .
    Kakade, S., Dayan, P.: Dopamine: Generalization and bonuses. Neural Netw. 15(4–6), 549–559 (2002)CrossRefGoogle Scholar
  74. .
    Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49(2), 209–232 (2002)zbMATHCrossRefGoogle Scholar
  75. .
    Keay, K.A., Bandler, R.: Parallel circuits mediating distinct emotional coping reactions to different types of stress. Neurosci. Biobehav. Rev. 25(7–8), 669–678 (2001)CrossRefGoogle Scholar
  76. .
    Killcross, S., Coutureau, E.: Coordination of actions and habits in the medial prefrontal cortex of rats. Cereb. Cortex 13(4), 400–408 (2003)CrossRefGoogle Scholar
  77. .
    Konidaris, G., Barto, A.: Building portable options: Skill transfer in reinforcement learning. In: IJCAI, pp. 895–900, Hyderabad, India (2007)Google Scholar
  78. .
    Konidaris, G., Barto, A.: Efficient skill learning using abstraction selection. In: IJCAI, pp. 1107–1112, Pasadena, California (2009)Google Scholar
  79. .
    Krueger, K.A., Dayan, P.: Flexible shaping: How learning in small steps helps. Cognition 110(3), 380–394 (2009)CrossRefGoogle Scholar
  80. .
    Mackintosh, N.J.: Conditioning and Associative Learning. Oxford University Press, Oxford (1983)Google Scholar
  81. .
    Madani, O., Hanks, S., Condon, A.: On the undecidability of probabilistic planning and related stochastic optimization problems. Artif. Intell. 147(1–2), 5–34 (2003)MathSciNetzbMATHCrossRefGoogle Scholar
  82. .
    Maier, S.F., Amat, J., Baratta, M.V., Paul, E., Watkins, L.R.: Behavioral control, the medial prefrontal cortex, and resilience. Dialogues Clin. Neurosci. 8(4), 397–406 (2006)Google Scholar
  83. .
    Maier, S.F., Watkins, L.R.: Stressor controllability and learned helplessness: The roles of the dorsal raphe nucleus, serotonin, and corticotropin-releasing factor. Neurosci. Biobehav. Rev. 29(4–5), 829–841 (2005)CrossRefGoogle Scholar
  84. .
    McNaughton, N., Corr, P.J.: A two-dimensional neuropsychology of defense: Fear/anxiety and defensive distance. Neurosci. Biobehav. Rev. 28(3), 285–305 (2004)CrossRefGoogle Scholar
  85. .
    Mirolli, M., Baldassarre, G.: Functions and mechanisms of intrinsic motivations: The knowledge versus competence distinction. In: Baldassarre, G., Mirolli, M. (eds.) Intrinsically Motivated Learning in Natural and Artificial Systems, pp. 49–72. Springer, Berlin (2012)Google Scholar
  86. .
    Mongillo, G., Barak, O., Tsodyks, M.: Synaptic theory of working memory. Science 319(5869), 1543–1546 (2008)CrossRefGoogle Scholar
  87. .
    Montague, P.R., Dayan, P., Sejnowski, T.J.: A framework for mesencephalic dopamine systems based on predictive hebbian learning. J. Neurosci. 16(5), 1936–1947 (1996)Google Scholar
  88. .
    Neal, R.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)MathSciNetGoogle Scholar
  89. .
    Ng, A., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: ICML, pp. 278–287, Bled, Slovenia (1999)Google Scholar
  90. .
    Nouri, A., Littman, M.: Multi-resolution exploration in continuous spaces. NIPS, pp. 1209–1216 (2009)Google Scholar
  91. .
    O’Reilly, R.C., Frank, M.J.: Making working memory work: A computational model of learning in the prefrontal cortex and basal ganglia. Neural Comput. 18(2), 283–328 (2006)MathSciNetzbMATHCrossRefGoogle Scholar
  92. .
    Oudeyer, P., Kaplan, F., Hafner, V.: Intrinsic motivation systems for autonomous mental development. IEEE Trans. Evol. Comput. 11(2), 265–286 (2007)CrossRefGoogle Scholar
  93. .
    Panksepp, J.: Affective Neuroscience. OUP, New York (1998)Google Scholar
  94. .
    Papadimitriou, C., Tsitsiklis, J.: The complexity of Markov decision processes. Math. Oper. Res. 12(3), 441–450 (1987)MathSciNetzbMATHCrossRefGoogle Scholar
  95. .
    Parr, R., Russell, S.: Reinforcement learning with hierarchies of machines. In: NIPS, pp. 1043–1049, Denver, Colorado (1998)Google Scholar
  96. .
    Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete bayesian reinforcement learning. In: ICML, pp. 697–704, Pittsburgh, Pennslyvania (2006)Google Scholar
  97. .
    Rao, R.P.N., Olshausen, B.A., Lewicki, M.S. (eds.): Probabilistic Models of the Brain: Perception and Neural Function. MIT, Cambridge (2002)Google Scholar
  98. .
    Redgrave, P., Gurney, K., Stafford, T., Thirkettle, M., Lewis, J.: The role of the basal ganglia in discovering novel actions. In: Baldassarre, G., Mirolli, M. (eds.) Intrinsically Motivated Learning in Natural and Artificial Systems, pp. 129–149. Springer, Berlin (2012)Google Scholar
  99. .
    Redgrave, P., Prescott, T.J., Gurney, K.: Is the short-latency dopamine response too short to signal reward error? Trends Neurosci. 22(4), 146–151 (1999)CrossRefGoogle Scholar
  100. .
    Reynolds, S.M., Berridge, K.C. (2001): Fear and feeding in the nucleus accumbens shell: Rostrocaudal segregation of GABA-elicited defensive behavior versus eating behavior. J. Neurosci. 21(9), 3261–3270 (1999)Google Scholar
  101. .
    Reynolds, S.M., Berridge, K.C.: Positive and negative motivation in nucleus accumbens shell: Bivalent rostrocaudal gradients for GABA-elicited eating, taste “liking”/“disliking” reactions, place preference/avoidance, and fear. J. Neurosci. 22(16), 7308–7320 (2002)Google Scholar
  102. .
    Reynolds, S.M., Berridge, K.C.: Emotional environments retune the valence of appetitive versus fearful functions in nucleus accumbens. Nat. Neurosci. 11(4), 423–425 (2008)CrossRefGoogle Scholar
  103. .
    Ring, M.: CHILD: A first step towards continual learning. Mach. Learn. 28(1), 77–104 (1997)zbMATHCrossRefGoogle Scholar
  104. .
    Ring, M.: Toward a formal framework for continual learning. In: NIPS Workshop on Inductive Transfer, Whistler, Canada (2005)Google Scholar
  105. .
    Rushworth, M.F.S., Behrens, T.E.J.: Choice, uncertainty and value in prefrontal and cingulate cortex. Nat. Neurosci. 11(4), 389–397 (2008)CrossRefGoogle Scholar
  106. .
    Ryan, R., Deci, E.: Intrinsic and extrinsic motivations: Classic definitions and new directions. Contemp. Educ. Psychol. 25(1), 54–67 (2000)CrossRefGoogle Scholar
  107. .
    Samejima, K., Doya, K., Kawato, M.: Inter-module credit assignment in modular reinforcement learning. Neural Netw. 16(7), 985–994 (2003)CrossRefGoogle Scholar
  108. .
    Samuel, A.: Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 3, 210–229 (1959)MathSciNetCrossRefGoogle Scholar
  109. .
    Schembri, M., Mirolli, M., Baldassarre, G.: Evolving childhood’s length and learning parameters in an intrinsically motivated reinforcement learning robot. In: Proceedings of the Seventh International Conference on Epigenetic Robotics, pp. 141–148, Piscataway, New Jersey (2007)Google Scholar
  110. .
    Schmidhuber, J.: Curious model-building control systems. In: IJCNN, pp. 1458–1463, Seattle, Washington State IEEE (1991)Google Scholar
  111. .
    Schmidhuber, J.: Gödel machines: Fully self-referential optimal universal self-improvers. Artif. Gen. Intell., pp. 199–226 (2006)Google Scholar
  112. .
    Schmidhuber, J.: Ultimate cognition à la gödel. Cogn. Comput. 1, 117–193 (2009)CrossRefGoogle Scholar
  113. .
    Seligman, M.: Helplessness: On Depression, Development, and Death. WH Freeman, San Francisco (1975)Google Scholar
  114. .
    Sheffield, F.: Relation between classical conditioning and instrumental learning. In: Prokasy, W. (ed.) Classical Conditioning, pp. 302–322. Appelton-Century-Crofts, New York (1965)Google Scholar
  115. .
    Şimşek, Ö., Barto, A.G.: An intrinsic reward mechanism for efficient exploration. In: ICML, pp. 833–840, Pittsburgh, Pennsylvania (2006)Google Scholar
  116. .
    Singh, S.: Transfer of learning by composing solutions of elemental sequential tasks. Mach. Learn. 8(3), 323–339 (1992)zbMATHGoogle Scholar
  117. .
    Singh, S., Barto, A., Chentanez, N.: Intrinsically motivated reinforcement learning. In: NIPS, pp. 1281–1288, Vancouver, Canada (2005)Google Scholar
  118. .
    Skinner, E.A.: A guide to constructs of control. J. Pers. Soc. Psychol. 71(3), 549–570 (1996)MathSciNetCrossRefGoogle Scholar
  119. .
    Smith, A., Li, M., Becker, S., Kapur, S.: Dopamine, prediction error and associative learning: A model-based account. Network 17(1), 61–84 (2006)CrossRefGoogle Scholar
  120. .
    Soubrié, P.: Reconciling the role of central serotonin neurons in human and animal behaviour. Behav. Brain Sci. 9, 319–364 (1986)CrossRefGoogle Scholar
  121. .
    Strens, M.: A Bayesian framework for reinforcement learning. In: ICML, pp. 943–950, Stanford, California (2000)Google Scholar
  122. .
    Suri, R.E., Schultz, W.: A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task. Neuroscience 91(3), 871–890 (1999)CrossRefGoogle Scholar
  123. .
    Sutton, R.: Learning to predict by the methods of temporal differences. Mach. Learn. 3(1), 9–44 (1988)Google Scholar
  124. .
    Sutton, R.: Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. ICML Austin, Texas 216, 224 (1990)Google Scholar
  125. .
    Sutton, R., Precup, D., Singh, S.: Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artif. Intell. 112(1), 181–211 (1999)MathSciNetzbMATHCrossRefGoogle Scholar
  126. .
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning). MIT, Cambridge (1998)Google Scholar
  127. .
    Tanaka, F., Yamamura, M.: Multitask reinforcement learning on the distribution of MDPs. IEEJ Trans. Electron. Inform. Syst. C 123(5), 1004–1011 (2003)Google Scholar
  128. .
    Teh, Y., Jordan, M., Beal, M., Blei, D.: Hierarchical dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)MathSciNetzbMATHCrossRefGoogle Scholar
  129. .
    Tenenbaum, J., Griffiths, T., Kemp, C.: Theory-based Bayesian models of inductive learning and reasoning. Trends Cogn. Sci. 10(7), 309–318 (2006)CrossRefGoogle Scholar
  130. .
    Thibaux, R., Jordan, M.: Hierarchical beta processes and the Indian buffet process. In: AIStats, pp. 564–571, San Juan, Puerto Rico (2007)Google Scholar
  131. .
    Thorndike, E.: Animal Intelligence. MacMillan, New York (1911)Google Scholar
  132. .
    Thrun, S., Schwartz, A.: Finding structure in reinforcement learning. In: NIPS, pp. 385–392, Denver, Colorado (1995)Google Scholar
  133. .
    Tolman, E.C.: Cognitive maps in rats and men. Psychol. Rev. 55(4), 189–208 (1948)CrossRefGoogle Scholar
  134. .
    Tricomi, E., Balleine, B.W., O’Doherty, J.P.: A specific role for posterior dorsolateral striatum in human habit learning. Eur. J. Neurosci. 29(11), 2225–2232 (2009)CrossRefGoogle Scholar
  135. .
    Valentin, V.V., Dickinson, A., O’Doherty, J.P.: Determining the neural substrates of goal-directed learning in the human brain. J. Neurosci. 27(15), 4019–4026 (2007)CrossRefGoogle Scholar
  136. .
    Vasilaki, E., Fusi, S., Wang, X.-J., Senn, W. (2009): Learning flexible sensori-motor mappings in a complex network. Biol. Cybern. 100(2), 147–158 (2007)Google Scholar
  137. .
    Wang, T., Lizotte, D., Bowling, M., Schuurmans, D.: Bayesian sparse sampling for on-line reward optimization. In: ICML, pp. 956–963, Bonn, Germany (2005)Google Scholar
  138. .
    Watkins, C. (1989): Learning from delayed rewards. Ph.D. Thesis, University of Cambridge (2005)Google Scholar
  139. .
    Wiering, M., Schmidhuber, J.: Efficient model-based exploration. In: Simulation of Adaptive Behavior, pp. 223–228, Zurich, Switzerland (1998)Google Scholar
  140. .
    Williams, D.R., Williams, H.: Auto-maintenance in the pigeon: Sustained pecking despite contingent non-reinforcement. J. Exp. Anal. Behav. 12(4), 511–520 (1969)CrossRefGoogle Scholar
  141. .
    Wilson, A., Fern, A., Ray, S., Tadepalli, P.: Multi-task reinforcement learning: A hierarchical bayesian approach. In: ICML, pp. 1015–1022, Corvallis, Oregon (2007)Google Scholar
  142. .
    Wingate, D., Goodman, N.D., Roy, D.M., Kaelbling, L.P., Tenenbaum, J.B.: Bayesian policy search with policy priors. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence-Volume, vol. 2, pp. 1565–1570. AAAI Press, Menlo Park (2011)Google Scholar
  143. .
    Wolpert, D.M., Kawato, M.: Multiple paired forward and inverse models for motor control. Neural Netw. 11(7–8), 1317–1329 (1998)CrossRefGoogle Scholar
  144. .
    Yoshida, W., Ishii, S.: Resolution of uncertainty in prefrontal cortex. Neuron 50(5), 781–789 (2006)CrossRefGoogle Scholar
  145. .
    Yu, A.J., Dayan, P.: Uncertainty, neuromodulation, and attention. Neuron 46(4), 681–692 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.University College London Gatsby Computational Neuroscience UnitLondonUK

Personalised recommendations