Totally model-free actor-critic recurrent neural-network reinforcement learning in non-Markovian domains

Mizutani, Eiji; Dreyfus, Stuart

doi:10.1007/s10479-016-2366-2

Totally model-free actor-critic recurrent neural-network reinforcement learning in non-Markovian domains

OR in Neuroscience
Published: 08 November 2016

Volume 258, pages 107–131, (2017)
Cite this article

Annals of Operations Research Aims and scope Submit manuscript

Eiji Mizutani¹ &
Stuart Dreyfus²

1040 Accesses
7 Citations
Explore all metrics

Abstract

For solving a sequential decision-making problem in a non-Markovian domain, standard dynamic programming (DP) requires a complete mathematical model; hence, a totally model-based approach. By contrast, this paper describes a totally model-free approach by actor-critic reinforcement learning with recurrent neural networks. The recurrent connections (or context units) in neural networks act as an implicit form of internal state (i.e., history memory) for developing sensitivity to hidden non-Markovian dependencies, rendering the process Markovian implicitly and automatically in a totally model-free fashion. That is, the model-free recurrent-network agent neither learns transitional probabilities and associated rewards, nor by how much the state space should be enlarged so that the Markov property holds. For concreteness, we illustrate time-lagged path problems, in which our learning agent is expected to learn a best (history-dependent) policy that maximizes the total return, the sum of one-step transitional rewards plus special “bonus” values dependent on prior transitions or decisions. Since we can obtain an optimal solution by model-based DP, this is an excellent test on the learning agent for understanding its model-free learning behavior. Such actor-critic recurrent-network learning might constitute a mechanism which animal brains use when experientially acquiring skilled action. Given a concrete non-Markovian problem example, the goal of this paper is to show the conceptual merit of totally model-free learning with actor-critic recurrent networks, compared with classical DP (and other model-building procedures), rather than pursue a best recurrent-network learning strategy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Introduction to Reinforcement Learning

Reinforcement Learning

Integrated Actor-Critic for Deep Reinforcement Learning

Notes

The recurrent neural networks may be employed not only for model-free learning but also for model-building procedures; e.g., see action model building in Lin (1993). Such model-building recurrent-network learning procedures are outside the scope of this paper. In addition, model-free learning schemes in more recent work [e.g., see Ni et al. (2015)] may not necessarily be designed to deal with non-Markovian hidden states.

References

Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13(5), 834–846.
Article Google Scholar
Bellman, R. E. (1961). Adaptive control processes. A guided tour. Princeton, NJ: Princeton University Press.
Book Google Scholar
Bellman, R. E., & Dreyfus, S. E. (1959). Functional approximations and dynamic programming. Mathematical Tables and Other Aids to Computation, 13, 247–251.
Article Google Scholar
Bellman, R. E., & Dreyfus, S. E. (1962). Applied dynamic programming. Princeton, NJ: Princeton University Press.
Book Google Scholar
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.
Google Scholar
Bush, K .A. (2008). An echo state model of non-Markovian reinforcement learning. Ph.D. thesis. Fort Collins, CO: Department of Computer Science, Colorado State University.
Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man, and Cybernetics, 13(5), 815–826.
Article Google Scholar
Crites, R. H., & Barto, A. G. (1995). An actor/critic algorithm that is equivalent to Q-learning. Advances in neural information processing systems 7 (pp. 401–408). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Dreyfus, S. E., & Law, A. (1977). The art and theory of dynamic programming, volume 130 of mathematics in science and engineering. Cambridge: Academic Press Inc.
Google Scholar
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211.
Article Google Scholar
Guez, A., Silver, D., & Dayan, P. (2013). Scalable and efficient bayes-adaptive reinforcement learning based on Monte-Carlo tree search. Journal of Artificial Intelligence Research, 48, 841–883.
Google Scholar
Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley.
Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Howard, R. A. (1960). Dynamic programming and Markov processes. Cambridge, MA: MIT Press.
Google Scholar
Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement learning algorithms for partially observable Markov decision problems. Advances in neural information processing systems 7 (pp. 345–352). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Jordan, M. I. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the eighth annual conference of the cognitive science society (pp. 531–546).
Konda, V. R. (2002). Actor-critic algorithms. Ph.D. thesis. Department of EECS, Massachusetts Institute of Technology.
Konda, V. R., & Tsitsiklis, J. N. (2003). Actor-critic algorithms. SIAM Journal on Control and Optimization, 42(4), 1143–1166.
Article Google Scholar
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444.
Article Google Scholar
Lin, L. J. (1993). Reinforcement learning for robots using neural networks. Ph.D. thesis. Pittsburgh, PA: School of Computer Science, Carnegie Mellon University.
Lin, L. J., & Mitchell, T. M. (1992). Memory approaches to reinforcement learning in non-Markovian domains. Technical report CMU-CS-92-138, School of Computer Science, Carnegie Mellon University.
Marbach, P., & Tsitsiklis, J. N. (2001). Simulation-based optimization of Markov reward processes. IEEE Transactions on Automatic Control, 46(2), 191–209.
Article Google Scholar
McCallum, A. K. (1995). Reinforcement learning with selective perception and hidden state. Ph.D. thesis. Rochester, NY: Department of Computer Science, University of Rochester (revised in 1996).
Millan, J. R., & Torras, C. (1992). A reinforcement connectionist approach to robot path finding in non-maze-like environments. Machine Learning, 8(3), 363–393.
Google Scholar
Mizutani, E. (1997). Learning from reinforcement. In J.-S . R. Jang, C.-T. Sun, & E. Mizutani (Eds.), Neuro-fuzzy and soft computing: A computational approach to learning and machine intelligence. Upper Saddle River, NJ.: Prentice Hall. Chapter 10.
Google Scholar
Mizutani, E. (1999). Sample path-based policy-only learning by actor neural networks. In Proceedings of the IEEE international conference on neural networks (Vol. 2, pp. 1245–1250). Washington, DC, July 10–16.
Mizutani, E., & Dreyfus, S. E. (1998). Totally model-free reinforcement learning by actor-critic elman networks in Non-Markovian domains. In Proceedings of the IEEE international joint conference on neural networks, part of the world congress on computational intelligence (Wcci’98) (pp 2016–2021) Alaska, May 4–9.
Mizutani, E., & Dreyfus, S. E. (2003). On using discretized Cohen-Grossberg node dynamics for model-free actor-critic neural learning in non-Markovian domains. In Proceedings of the IEEE international symposium on computational intelligence in robotics and automation (CIRA 2003) (Vol. 1, pp. 1–6). Kobe, July 16–20.
Mizutani, E., & Dreyfus, S. E. (2010). An analysis on negative curvature induced by singularity in multi-layer neural-network learning. In Advances in neural information processing systems, (NIPS 2010) (pp. 1669–1677).
Mizutani, E., Dreyfus, S. E., & Nishio, K. (2000). On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application. In Proceedings of the IEEE international conference on neural networks (Vol. 2, pp. 167–172) Como, July. http://queue.ieor.berkeley.edu/People/Faculty/dreyfus-pubs/ijcnn2k.pdf.
Ni, Z., He, H., Wen, J., & Xu, X. (2013). Goal representation heuristic dynamic programming on maze navigation. IEEE Transactions on Neural Networks and Learning Systems, 24(12), 2038–2050.
Article Google Scholar
Ni, Z., He, H., Zhong, X., & Prokhorov, D. V. (2015). Model-free dual heuristic dynamic programming. IEEE Transactions on Neural Networks and Learning Systems, 26(5), 1834–1839.
Article Google Scholar
Niv, Y. (2009). Reinforcement learning in the brain. Journal of Mathematical Psychology, 53, 139–154.
Article Google Scholar
Pendrith, F. M. (1994). On reinforcement learning of control actions in noisy and non-Markovian domains. Technical report: UNSW-CSE-TR-9410. Sydney: School of Computer Science and Engineering, The University of New South Wales.
Powell, W. B. (2012). Perspectives of approximate dynamic programming. Annals of Operations Research, 141, 1–38.
Google Scholar
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D . E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Chapter 8.
Google Scholar
Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599.
Article Google Scholar
Si, J., & Wang, Y. T. (2001). On-line learning control by association and reinforcement. IEEE Transactions on Neural Networks, 12(2), 264–276.
Article Google Scholar
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.
Google Scholar
Sutton, R. S. (1990). Reinforcement learning architectures for animats. In Proceedings of the first international conference on simulation of adaptive behavior: From animals to animats (pp. 288–296).
Sutton, R. S. (1991). Planning by incremental dynamic programming. In L. A. Birnbaum & G. C. Collins (Eds.), Machine learning: Proceedings of the eighth international workshop (pp. 353–357). San Mateo, CA: Morgan Kaufmann.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
Google Scholar
Watkins, C. J. C. H., & Dayan, P. (1992). Technical note: Q-learning. Machine Learning, 8(3), 279–292.
Google Scholar
Werbos, P. J. (1990). Backpropagation through time: What it does and how to do it. In Proceedings of the IEEE (Vol. 78, No. 10, pp. 1550–1560).
Werbos, P. J. (1992). Neurocontrol and supervised learning: An overview and evaluation. In D. A. White & D. A. Sorge (Eds.), Handbook of intelligent control neural, fuzzy, and adaptive approaches (pp. 65–89). New York: Van Nostrand Reinhold. Chapter 3.
Google Scholar
Whitehead, S. D., & Ballard, D. H. (1991). Learning to perceive and act by trial and error. Machine Learning, 7, 45–83.
Google Scholar
Whitehead, S. D., & Lin, L. J. (1995). Reinforcement learning of non-Markov decision process. Artificial Intelligence, 73, 271–306.
Article Google Scholar
Widrow, B., Gupta, N. K., & Maitra, S. (1973). Punish/reward: Learning with a critic in adaptive threshold systems. IEEE Transactions on Systems, Man, and Cybernetics, 3, 455–465.
Article Google Scholar

Download references

Acknowledgements

Funding was provided by Ministry of Science and Technology, Taiwan (104-2221-E-011-096).

Author information

Authors and Affiliations

National Taiwan University of Science and Technology, 43 Keelung Road, Taipei, 106, Taiwan
Eiji Mizutani
University of California at Berkeley, Berkeley, CA, 94720, USA
Stuart Dreyfus

Authors

Eiji Mizutani
View author publications
You can also search for this author in PubMed Google Scholar
Stuart Dreyfus
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eiji Mizutani.

Appendices

Appendix 1: Stochastic dynamic programming (DP) solution procedures

We show the standard backward DP solution procedures for stochastic problems subject to the following two stochastic versions of the bonus rule introduced in Sect. 2.2:

(1)
Prior-transition dependent bonus rule, and
(2)
Prior-action dependent bonus rule.

Here, the process is stochastic in the sense that when the agent tries to move in a certain direction, either diagonally up or down, it does so with probability p but it goes in the other direction with probability $1-p$. In our numerical example in Fig. 8, we chose $p = 0.9$ and the bonus value 4.

As described in Sect. 2.3, the classical DP is a model-based approach, requiring explicit transition and reward models to allow a DP solution based on the principle of optimality. To solve our longest path problem with bonus rule, we must enlarge the state space to make the process Markovian, by choosing proper arguments for the optimal value function; this is in general a matter of art (rather than science) [see a variety of examples in Bellman and Dreyfus (1962), Dreyfus and Law (1977)]. These arguments must explicitly define the appropriate amount of information given the bonus rule.

Appendix 1.1: DP solution to the prior-transition dependent problem

For the prior-transition dependent case, the state space must contain the information of the current two-dimensional coordinates plus the last two consecutive “actual” transitions no matter what actions were taken. Our classical backward DP-formulation needs the following four steps under the assumption of the unit discount factor in our finite-horizon path problem.

① Definition of the optimal value function:
$$\begin{aligned} \begin{array}{l} V(x,y,z_1,z_2) \mathop {=}\limits ^\mathrm{def}\; \text{ maximum } \text{ expected } \text{ reward-to-go, } \text{ starting } \text{ at } \text{ vertex }~(x,y)~\hbox {to end},\\ \text{ with }~transition~z_1~\hbox {one stage before, and}~z_2~\hbox {two stages before}. \end{array}\nonumber \\ \end{aligned}$$
(16)
② Recurrence relation using one-stage reward $R_z(x,y)$ at the current vertex (x, y) when action z [$z=u$ (up) or d (down)] is taken:
$$\begin{aligned} \begin{array}{l} V(x,y,u,d) = \text{ max } \left\{ \begin{array}{l} u: \begin{array}{l} p \left[ R_u(x,y) + V(x+1,y+1,u,u) \right] \\ +\,(1- p) \left[ R_d(x,y) + V(x+1,y-1,d,u) + \text{ bonus } \right] \\ \end{array}\\ d: \begin{array}{l} p \left[ R_d(x,y) + V(x+1,y-1,d,u) + \text{ bonus } \right] \\ +\,(1-p) \left[ R_u(x,y) + V(x+1,y+1,u,u) \right] \\ \end{array} \end{array} \right. \\ V(x,y,u,u) =\text{ max } \left\{ \begin{array}{l} u: \begin{array}{l} p \left[ R_u(x,y) + V(x+1,y+1,u,u) + \text{ bonus } \right] \\ +\,(1- p) \left[ R_d(x,y) + V(x+1,y-1,d,u) \right] \\ \end{array}\\ d: \begin{array}{l} p \left[ R_d(x,y) + V(x+1,y-1,d,u) \right] \\ +\,(1-p) \left[ R_u(x,y) + V(x+1,y+1,u,u) + \text{ bonus } \right] \\ \end{array} \end{array} \right. \end{array} \end{aligned}$$
with obvious modifications for V(x, y, d, u) and V(x, y, d, d).
- The best action z (u or d) selected for evaluating each $V(x,y,z_1,z_2)$ above is stored into the so-called optimal policy function as
  $$\begin{aligned} \pi (x,y, z_1, z_2)=z. \end{aligned}$$
③ Boundary condition at terminal stage $N (=4)$ in Fig. 8:

$V(5,0,-,-)=-1$; $V(5,2,-,-)=3$; $V(5,4,-,-)=V(5,6,-,-)=0$; $V(5,8,-,-)=1$.
④ Answer is given by $V(A,-,-)=V(1,4,-,-)$ at the initial vertex A (1, 4). $\square $

Using full backups (see Fig. 3) with $p=0.9$ and bonus value 4, the above DP procedure yields the optimal policy, which may be summarized with such eight possible transitional paths (or trajectories) as shown in Table 4 together with the simulation results obtained by model-free learning.

Appendix 1.2: DP solution to the prior-action dependent problem

Next, for solving the prior-action dependent case, the arguments of the optimal value function must explicitly define the current two-dimensional coordinates plus the last two consecutive actions taken whatever actual transitions followed. Here are the four steps of classical backward DP with the unit discount rate assumed:

① Definition of the optimal value function:
$$\begin{aligned} \begin{array}{l} V(x,y,z_1,z_2) \mathop {=}\limits ^\mathrm{def}\; \text{ maximum } \text{ expected } \text{ reward-to-go, } \text{ starting } \text{ at } \text{ vertex }~(x,y)~\hbox {to end},\\ \text{ with } \text{ action }~z_1~\hbox {one stage before, and}~z_2~\hbox {two stages before}. \end{array}\nonumber \\ \end{aligned}$$
(17)
② Recurrence relation using one-stage reward $R_z(x,y)$ at the current vertex (x, y) when action z [$z=u$ (up) or d (down)] is taken:
$$\begin{aligned} \begin{array}{l} V(x,y,u,d) =\text{ max } \left\{ \begin{array}{l} u: \begin{array}{l} p \left[ R_u(x,y) + V(x+1,y+1,u,u) \right] \\ +\,(1- p) \left[ R_d(x,y) + V(x+1,y-1,u,u) \right] \\ \end{array}\\ d: \begin{array}{l} p \left[ R_d(x,y) + V(x+1,y-1,d,u) + \text{ bonus } \right] \\ +\,(1-p) \left[ R_u(x,y) + V(x+1,y+1,d,u) + \text{ bonus } \right] \\ \end{array} \end{array} \right. \\ V(x,y,u,u) =\text{ max } \left\{ \begin{array}{l} u: \begin{array}{l} p \left[ R_u(x,y) + V(x+1,y+1,u,u) + \text{ bonus } \right] \\ +\,(1- p) \left[ R_d(x,y) + V(x+1,y-1,u,u) + \text{ bonus } \right] \\ \end{array}\\ d: \begin{array}{l} p \left[ R_d(x,y) + V(x+1,y-1,d,u) \right] \\ +\,(1-p) \left[ R_u(x,y) + V(x+1,y+1,d,u) \right] \\ \end{array} \end{array} \right. \end{array} \end{aligned}$$
with obvious modifications for V(x, y, d, u) and V(x, y, d, d).
- The best action z (u or d) selected for evaluating each $V(x,y,z_1,z_2)$ above is stored into the so-called optimal policy function as
  $$\begin{aligned} \pi (x,y, z_1, z_2)=z. \end{aligned}$$
③ Boundary condition at terminal stage $N (=4)$ in Fig. 8:

$V(5,0,-,-)=-1$; $V(5,2,-,-)=3$; $V(5,4,-,-)=V(5,6,-,-)=0$; $V(5,8,-,-)=1$.
④ Answer is given by $V(A,-,-)=V(1,4,-,-)$ at the initial vertex A (1, 4). $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mizutani, E., Dreyfus, S. Totally model-free actor-critic recurrent neural-network reinforcement learning in non-Markovian domains. Ann Oper Res 258, 107–131 (2017). https://doi.org/10.1007/s10479-016-2366-2

Download citation

Published: 08 November 2016
Issue Date: November 2017
DOI: https://doi.org/10.1007/s10479-016-2366-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Totally model-free actor-critic recurrent neural-network reinforcement learning in non-Markovian domains

Abstract

Access this article

Similar content being viewed by others

Introduction to Reinforcement Learning

Reinforcement Learning

Integrated Actor-Critic for Deep Reinforcement Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Stochastic dynamic programming (DP) solution procedures

Appendix 1.1: DP solution to the prior-transition dependent problem

Appendix 1.2: DP solution to the prior-action dependent problem

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Totally model-free actor-critic recurrent neural-network reinforcement learning in non-Markovian domains

Abstract

Access this article

Similar content being viewed by others

Introduction to Reinforcement Learning

Reinforcement Learning

Integrated Actor-Critic for Deep Reinforcement Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Stochastic dynamic programming (DP) solution procedures

Appendix 1.1: DP solution to the prior-transition dependent problem

Appendix 1.2: DP solution to the prior-action dependent problem

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation