Advertisement

Divide and Conquer: Hierarchical Reinforcement Learning and Task Decomposition in Humans

  • Carlos DiukEmail author
  • Anna Schapiro
  • Natalia Córdova
  • José Ribas-Fernandes
  • Yael Niv
  • Matthew Botvinick
Chapter

Abstract

The field of computational reinforcement learning (RL) has proved extremely useful in research on human and animal behavior and brain function. However, the simple forms of RL considered in most empirical research do not scale well, making their relevance to complex, real-world behavior unclear. In computational RL, one strategy for addressing the scaling problem is to introduce hierarchical structure, an approach that has intriguing parallels with human behavior. We have begun to investigate the potential relevance of hierarchical RL (HRL) to human and animal behavior and brain function. In the present chapter, we first review two results that show the existence of neural correlates to key predictions from HRL. Then, we focus on one aspect of this work, which deals with the question of how action hierarchies are initially established. Work in HRL suggests that hierarchy learning is accomplished by identifying useful subgoal states, and that this might in turn be accomplished through a structural analysis of the given task domain. We review results from a set of behavioral and neuroimaging experiments, in which we have investigated the relevance of these ideas to human learning and decision making.

Keywords

Prediction Error Reinforcement Learning Slot Machine Adjacency Relation Temporal Abstraction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. Aldridge, J. W., & Berridge, K. C. (1998). Coding of serial order by neostriatal neurons: a “natural action” approach to movement sequence. Journal of Neuroscience, 18(7), 2777–2787.Google Scholar
  2. Badre, D. (2008). Cognitive control, hierarchy, and the rostro-caudal organization of the frontal lobes. Trends in Cognitive Sciences, 12(5), 193–200.CrossRefGoogle Scholar
  3. Baldassarre, G., & Mirolli, M. (Eds.), (2012). Intrinsically motivated learning in natural and artificial systems. Berlin: Springer.Google Scholar
  4. Barto, A. G., & Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4), 341–379.MathSciNetCrossRefGoogle Scholar
  5. Botvinick, M., & Plaut, D. C. (2004). Doing without schema hierarchies: a recurrent connectionist approach to normal and impaired routine sequential action. Psychological Review, 111(2), 395–429.CrossRefGoogle Scholar
  6. Botvinick, M. M., Niv, Y., Barto, A. C. (2009). Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. Cognition, 113(3), 262–280.CrossRefGoogle Scholar
  7. Bruner, J. (1975). Organization of early skilled action. Child Development, 44, 1–11.CrossRefGoogle Scholar
  8. Conway, C. M., & Christiansen, M. H. (2001). Sequential learning in non-human primates. Trends in Cognitive Sciences, 5(12), 539–546.CrossRefGoogle Scholar
  9. Cooper, R., & Shallice, T. (2000). Contention scheduling and the control of routine activities. Cognitive Neuropsychology, 17(4), 297–338.CrossRefGoogle Scholar
  10. Daw, N. D., Courville, A. C., Touretzky, D. S. (2003). Timing and partial observability in the dopamine system. In Advances in Neural Information Processing Systems (NIPS). Cambridge: MIT.Google Scholar
  11. Dayan, P., & Hinton, G. E. (1993). Feudal reinforcement learning. In Advances in neural information processing systems 5 (pp. 271–278). San Mateo: Morgan Kaufmann.Google Scholar
  12. Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13, 227–303.MathSciNetzbMATHGoogle Scholar
  13. Diuk, C., Cordova, N., Niv, Y., Botvinick, M. (2012a). Discovering hierarchical task structure. Submitted.Google Scholar
  14. Diuk, C., Tsai, K., Wallis, J., Niv, Y., Botvinick, M. M. (2012b). Hierarchical learning induces two simultaneous, but separable, prediction errors in human basal ganglia. The Journal of Neuroscience, 33(13), 5797–5805.CrossRefGoogle Scholar
  15. Elfwing, S., Uchibe, E., Doya, K., Christensen, H. I. (2007). Evolutionary development of hierarchical learning structures. IEEE Transactions on Evolutionary Computation, 11(2), 249–264.CrossRefGoogle Scholar
  16. Fischer, K. W. (1980). A theory of cognitive development: the control and construction of hierarchies of skills. Psychological Review, 87(6), 477–537CrossRefGoogle Scholar
  17. Fuster, J. M. (1997). The prefrontal cortex: anatomy, physiology, and neuropsychology of the frontal lobe, 3rd edn. Philadelphia: Lippincott-Raven.Google Scholar
  18. Haruno, M., & Kawato, M. (2006). Heterarchical reinforcement-learning model for integration of multiple cortico-striatal loops: fMRI examination in stimulus-action-reward association learning. Neural networks: the official journal of the international neural network society, 19(8), 1242–1254.CrossRefzbMATHGoogle Scholar
  19. Hengst, B. (2002). Discovering hierarchy in reinforcement learning with HEXQ. In Proceedings of the 19th international conference on machine learning, Sydney, Australia.Google Scholar
  20. Houk, J., Adams, J., Barto, A. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. Houk, J. Davis, D. Beiser (Eds.), Models of information processing in the basal ganglia. Cambridge: MIT.Google Scholar
  21. Ito, M., & Doya, K. (2011). Multiple representations and algorithms for reinforcement learning in the cortico-basal ganglia circuit. Current Opinion in Neurobiology, 21(3), 368–373.CrossRefGoogle Scholar
  22. Joel, D., Niv, Y., Ruppin, E. (2002). Actor—critic models of the basal ganglia: new anatomical and computational perspectives. Neural Networks, 15, 535–547.CrossRefGoogle Scholar
  23. Jonsson, A., & Barto, A. (2006). Causal graph based decomposition of factored MDPs. Journal of Machine Learning Research, 7, 2259–2301.MathSciNetzbMATHGoogle Scholar
  24. Koechlin, E., Ody, C., Kouneiher, F. (2003). The architecture of cognitive control in the human prefrontal cortex. Science (New York, N.Y.), 302(5648), 1181–1185.Google Scholar
  25. Lashley, K. S. (1951). The problem of serial order in behavior. New York: WileyGoogle Scholar
  26. Li, L., Walsh, T. J., Littman, M. L. (2006). Towards a unified theory of state abstraction for MDPs. In Proceedings of the ninth international symposium on artificial intelligence and mathematics (AMAI-06).Google Scholar
  27. McGovern, A., & Barto, A. G. (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. Proceedings of the 18th international conference on machine learning.Google Scholar
  28. Menache, I., Mannor, S., Shimkin, N. (2002). Q-cut-dynamic discovery of sub-goals in reinforcement learning. In European conference on machine learning (ECML 2002) (pp. 295–306).Google Scholar
  29. Miller, E. K., & Cohen, J. D. (2001). An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 24, 167–202.CrossRefGoogle Scholar
  30. Miller, G. A., Galanter, E., Pribram, K. H. (1960). Plans and the structure of behavior. New York: Adams-Bannister-CoxCrossRefGoogle Scholar
  31. Montague, P. R., Dayan, P., Sejnowski, T. J. (1996). A framework for mesencephalic predictive hebbian learning. Journal of Neuroscience, 16(5), 1936–1947.Google Scholar
  32. O’Doherty, J., Critchley, H., Deichmann, R., Dolan, R. J. (2003). Dissociating valence of outcome from behavioral control in human orbital and ventral prefrontal cortices. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 23(21), 7931–7939.Google Scholar
  33. Opsahl, T., Agneessens, F., Skvoretz, J. (2010). Node centrality in weighted networks: generalizing degree and shortest paths. Social Networks, 32, 245–251.CrossRefGoogle Scholar
  34. Parr, R., & Russell, S. J. (1998). Reinforcement learning with hierarchies of machines. Advances in neural information processing systems.Google Scholar
  35. Picket, M., & Barto, A. (2002). Policyblocks: An algorithm for creating useful macro-actions in reinforcement learning. In Proceedings of the 19th International conference on machine learning.Google Scholar
  36. Ribas-Fernandes, J. J. F., Solway, A., Diuk, C., McGuire, J. T., Barto, A. G., Niv, Y., Botvinick, M. M. (2011). A neural signature of hierarchical reinforcement learning. Neuron, 71(2), 370–379.CrossRefGoogle Scholar
  37. Schank, R. C., & Abelson, R. P. (1977). Scripts, plans, goals, and understanding: an inquiry into human knowledge structures. Hillsdale: Lawrence Erlbaum.zbMATHGoogle Scholar
  38. Schapiro, A., Rogers, T., Cordova, N., Turk-Browne, N., Botvinick, M. (2013). Neural representations of events arise from temporal community structure. Nature Neuroscience, 16, 486–492.CrossRefGoogle Scholar
  39. Schembri, M., Mirolli, M., Baldassarre, G. (2007a). Evolution and learning in an intrinsically motivated reinforcement learning robot. In F. Almeida y Costa, L. M. Rocha, E. Costa, I. Harvey, A. Coutinho (Eds.), Advances in artificial life. Proceedings of the 9th European conference on artificial life. LNAI (vol. 4648, pp. 294–333). Berlin: Springer.Google Scholar
  40. Schembri, M., Mirolli, M., Baldassarre, G. (2007b). Evolving internal reinforcers for an intrinsically motivated reinforcement-learning robot. In Y. Demiris, D. Mareschal, B. Scassellati, J. Weng (Eds.), Proceedings of the 6th international conference on development and learning (pp. E1–E6). London: Imperial College.Google Scholar
  41. Schmidhuber, J. (1991a). A possibility for implementing curiosity and boredom in model-building neural controllers. In Proceedings of the international conference on simulation of adaptive behavior: from animals to animats (pp. 222–227).Google Scholar
  42. Schmidhuber, J. (1991b). Curious model-building control systems. Proceedings of the International Conference on Neural Networks, 2, 1458–1463.Google Scholar
  43. Schneider, D. W. & Logan, G. D. (2006). Hierarchical control of cognitive processes: switching tasks in sequences. Journal of Experimental Psychology: General, 135(4), 623–640.CrossRefGoogle Scholar
  44. Schoenbaum, G., Chiba, A. A., Gallagher, M. (1999). Neural encoding in orbitofrontal cortex and basolateral amygdala during olfactory discrimination learning. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 19(5), 1876–84.Google Scholar
  45. Schultz, W., Dayan, P., Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275(March 1997), 1593–1599.CrossRefGoogle Scholar
  46. Schultz, W., Tremblay, L., Hollerman, J. R. (2000). Reward processing in primate orbitofrontal cortex and basal ganglia. Cerebral Cortex, 10(3), 272–84.CrossRefGoogle Scholar
  47. Şimşek, O. (2008). Behavioral building blocks for autonomous agents: description, identification, and learning. PhD thesis, University of Massachussetts, Amherst.Google Scholar
  48. Şimşek, O., Barto, A. G. (2009). Skill Characterization Based on Betweenness. In D. Koller, D. Schuurmans, Y. Bengio, L. Bottou (Eds.), Advances in neural information processing systems 21 (pp. 1497–1504).Google Scholar
  49. Şimşek, O., Wolfe, A. P., Barto, A. G. (2005). Identifying useful subgoals in reinforcement learning by local graph partitioning. In Proceedings of the twenty-second international conference on machine learning.Google Scholar
  50. Singh, S., Barto, A., & Chentanez, N. (2005). Proceedings of Advances in Neural Information Processing Systems 17.Google Scholar
  51. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. Cambridge: MIT.Google Scholar
  52. Sutton, R. S., Precup, D., Singh, S. (1999). Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, 181–211.MathSciNetCrossRefzbMATHGoogle Scholar
  53. Thrun, S., & Schwartz, A. (1995). Finding structure in reinforcement learning. In G. Tesauro, D. Touretzky, T. Leen (Eds.), Advances in neural information processing systems (NIPS) 7. Cambridge: MIT.Google Scholar
  54. Yamada, S., & Tsuji, S. (1989). Selective learning of macro-operators with perfect causality. In Proceedings of the 11th international joint conference on Artificial intelligence, Volume 1 (pp. 603–608), San Francisco: Morgan Kaufmann Publishers Inc.Google Scholar
  55. Zacks, J. M., Speer, N. K., Swallow, K. M., Braver, T. S., Reynolds, J. R. (2007). Event perception: a mind-brain perspective. Psychological Bulletin, 133(2), 273–293.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Carlos Diuk
    • 1
    Email author
  • Anna Schapiro
    • 1
  • Natalia Córdova
    • 1
  • José Ribas-Fernandes
    • 1
  • Yael Niv
    • 1
  • Matthew Botvinick
    • 1
  1. 1.Department of Psychology and Princeton Neuroscience InstitutePrinceton UniversityPrincetonUSA

Personalised recommendations