Skip to main content
Log in

Totally model-free actor-critic recurrent neural-network reinforcement learning in non-Markovian domains

  • OR in Neuroscience
  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

For solving a sequential decision-making problem in a non-Markovian domain, standard dynamic programming (DP) requires a complete mathematical model; hence, a totally model-based approach. By contrast, this paper describes a totally model-free approach by actor-critic reinforcement learning with recurrent neural networks. The recurrent connections (or context units) in neural networks act as an implicit form of internal state (i.e., history memory) for developing sensitivity to hidden non-Markovian dependencies, rendering the process Markovian implicitly and automatically in a totally model-free fashion. That is, the model-free recurrent-network agent neither learns transitional probabilities and associated rewards, nor by how much the state space should be enlarged so that the Markov property holds. For concreteness, we illustrate time-lagged path problems, in which our learning agent is expected to learn a best (history-dependent) policy that maximizes the total return, the sum of one-step transitional rewards plus special “bonus” values dependent on prior transitions or decisions. Since we can obtain an optimal solution by model-based DP, this is an excellent test on the learning agent for understanding its model-free learning behavior. Such actor-critic recurrent-network learning might constitute a mechanism which animal brains use when experientially acquiring skilled action. Given a concrete non-Markovian problem example, the goal of this paper is to show the conceptual merit of totally model-free learning with actor-critic recurrent networks, compared with classical DP (and other model-building procedures), rather than pursue a best recurrent-network learning strategy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. The recurrent neural networks may be employed not only for model-free learning but also for model-building procedures; e.g., see action model building in Lin (1993). Such model-building recurrent-network learning procedures are outside the scope of this paper. In addition, model-free learning schemes in more recent work [e.g., see Ni et al. (2015)] may not necessarily be designed to deal with non-Markovian hidden states.

References

  • Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13(5), 834–846.

    Article  Google Scholar 

  • Bellman, R. E. (1961). Adaptive control processes. A guided tour. Princeton, NJ: Princeton University Press.

    Book  Google Scholar 

  • Bellman, R. E., & Dreyfus, S. E. (1959). Functional approximations and dynamic programming. Mathematical Tables and Other Aids to Computation, 13, 247–251.

    Article  Google Scholar 

  • Bellman, R. E., & Dreyfus, S. E. (1962). Applied dynamic programming. Princeton, NJ: Princeton University Press.

    Book  Google Scholar 

  • Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.

    Google Scholar 

  • Bush, K .A. (2008). An echo state model of non-Markovian reinforcement learning. Ph.D. thesis. Fort Collins, CO: Department of Computer Science, Colorado State University.

  • Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man, and Cybernetics, 13(5), 815–826.

    Article  Google Scholar 

  • Crites, R. H., & Barto, A. G. (1995). An actor/critic algorithm that is equivalent to Q-learning. Advances in neural information processing systems 7 (pp. 401–408). San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Dreyfus, S. E., & Law, A. (1977). The art and theory of dynamic programming, volume 130 of mathematics in science and engineering. Cambridge: Academic Press Inc.

    Google Scholar 

  • Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211.

    Article  Google Scholar 

  • Guez, A., Silver, D., & Dayan, P. (2013). Scalable and efficient bayes-adaptive reinforcement learning based on Monte-Carlo tree search. Journal of Artificial Intelligence Research, 48, 841–883.

    Google Scholar 

  • Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley.

    Google Scholar 

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Howard, R. A. (1960). Dynamic programming and Markov processes. Cambridge, MA: MIT Press.

    Google Scholar 

  • Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement learning algorithms for partially observable Markov decision problems. Advances in neural information processing systems 7 (pp. 345–352). San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Jordan, M. I. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the eighth annual conference of the cognitive science society (pp. 531–546).

  • Konda, V. R. (2002). Actor-critic algorithms. Ph.D. thesis. Department of EECS, Massachusetts Institute of Technology.

  • Konda, V. R., & Tsitsiklis, J. N. (2003). Actor-critic algorithms. SIAM Journal on Control and Optimization, 42(4), 1143–1166.

    Article  Google Scholar 

  • LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444.

    Article  Google Scholar 

  • Lin, L. J. (1993). Reinforcement learning for robots using neural networks. Ph.D. thesis. Pittsburgh, PA: School of Computer Science, Carnegie Mellon University.

  • Lin, L. J., & Mitchell, T. M. (1992). Memory approaches to reinforcement learning in non-Markovian domains. Technical report CMU-CS-92-138, School of Computer Science, Carnegie Mellon University.

  • Marbach, P., & Tsitsiklis, J. N. (2001). Simulation-based optimization of Markov reward processes. IEEE Transactions on Automatic Control, 46(2), 191–209.

    Article  Google Scholar 

  • McCallum, A. K. (1995). Reinforcement learning with selective perception and hidden state. Ph.D. thesis. Rochester, NY: Department of Computer Science, University of Rochester (revised in 1996).

  • Millan, J. R., & Torras, C. (1992). A reinforcement connectionist approach to robot path finding in non-maze-like environments. Machine Learning, 8(3), 363–393.

    Google Scholar 

  • Mizutani, E. (1997). Learning from reinforcement. In J.-S . R. Jang, C.-T. Sun, & E. Mizutani (Eds.), Neuro-fuzzy and soft computing: A computational approach to learning and machine intelligence. Upper Saddle River, NJ.: Prentice Hall. Chapter 10.

    Google Scholar 

  • Mizutani, E. (1999). Sample path-based policy-only learning by actor neural networks. In Proceedings of the IEEE international conference on neural networks (Vol. 2, pp. 1245–1250). Washington, DC, July 10–16.

  • Mizutani, E., & Dreyfus, S. E. (1998). Totally model-free reinforcement learning by actor-critic elman networks in Non-Markovian domains. In Proceedings of the IEEE international joint conference on neural networks, part of the world congress on computational intelligence (Wcci’98) (pp 2016–2021) Alaska, May 4–9.

  • Mizutani, E., & Dreyfus, S. E. (2003). On using discretized Cohen-Grossberg node dynamics for model-free actor-critic neural learning in non-Markovian domains. In Proceedings of the IEEE international symposium on computational intelligence in robotics and automation (CIRA 2003) (Vol. 1, pp. 1–6). Kobe, July 16–20.

  • Mizutani, E., & Dreyfus, S. E. (2010). An analysis on negative curvature induced by singularity in multi-layer neural-network learning. In Advances in neural information processing systems, (NIPS 2010) (pp. 1669–1677).

  • Mizutani, E., Dreyfus, S. E., & Nishio, K. (2000). On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application. In Proceedings of the IEEE international conference on neural networks (Vol. 2, pp. 167–172) Como, July. http://queue.ieor.berkeley.edu/People/Faculty/dreyfus-pubs/ijcnn2k.pdf.

  • Ni, Z., He, H., Wen, J., & Xu, X. (2013). Goal representation heuristic dynamic programming on maze navigation. IEEE Transactions on Neural Networks and Learning Systems, 24(12), 2038–2050.

    Article  Google Scholar 

  • Ni, Z., He, H., Zhong, X., & Prokhorov, D. V. (2015). Model-free dual heuristic dynamic programming. IEEE Transactions on Neural Networks and Learning Systems, 26(5), 1834–1839.

    Article  Google Scholar 

  • Niv, Y. (2009). Reinforcement learning in the brain. Journal of Mathematical Psychology, 53, 139–154.

    Article  Google Scholar 

  • Pendrith, F. M. (1994). On reinforcement learning of control actions in noisy and non-Markovian domains. Technical report: UNSW-CSE-TR-9410. Sydney: School of Computer Science and Engineering, The University of New South Wales.

  • Powell, W. B. (2012). Perspectives of approximate dynamic programming. Annals of Operations Research, 141, 1–38.

    Google Scholar 

  • Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D . E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Chapter 8.

    Google Scholar 

  • Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599.

    Article  Google Scholar 

  • Si, J., & Wang, Y. T. (2001). On-line learning control by association and reinforcement. IEEE Transactions on Neural Networks, 12(2), 264–276.

    Article  Google Scholar 

  • Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.

    Google Scholar 

  • Sutton, R. S. (1990). Reinforcement learning architectures for animats. In Proceedings of the first international conference on simulation of adaptive behavior: From animals to animats (pp. 288–296).

  • Sutton, R. S. (1991). Planning by incremental dynamic programming. In L. A. Birnbaum & G. C. Collins (Eds.), Machine learning: Proceedings of the eighth international workshop (pp. 353–357). San Mateo, CA: Morgan Kaufmann.

  • Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.

    Google Scholar 

  • Watkins, C. J. C. H., & Dayan, P. (1992). Technical note: Q-learning. Machine Learning, 8(3), 279–292.

    Google Scholar 

  • Werbos, P. J. (1990). Backpropagation through time: What it does and how to do it. In Proceedings of the IEEE (Vol. 78, No. 10, pp. 1550–1560).

  • Werbos, P. J. (1992). Neurocontrol and supervised learning: An overview and evaluation. In D. A. White & D. A. Sorge (Eds.), Handbook of intelligent control neural, fuzzy, and adaptive approaches (pp. 65–89). New York: Van Nostrand Reinhold. Chapter 3.

    Google Scholar 

  • Whitehead, S. D., & Ballard, D. H. (1991). Learning to perceive and act by trial and error. Machine Learning, 7, 45–83.

    Google Scholar 

  • Whitehead, S. D., & Lin, L. J. (1995). Reinforcement learning of non-Markov decision process. Artificial Intelligence, 73, 271–306.

    Article  Google Scholar 

  • Widrow, B., Gupta, N. K., & Maitra, S. (1973). Punish/reward: Learning with a critic in adaptive threshold systems. IEEE Transactions on Systems, Man, and Cybernetics, 3, 455–465.

    Article  Google Scholar 

Download references

Acknowledgements

Funding was provided by Ministry of Science and Technology, Taiwan (104-2221-E-011-096).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eiji Mizutani.

Appendices

Appendix 1: Stochastic dynamic programming (DP) solution procedures

We show the standard backward DP solution procedures for stochastic problems subject to the following two stochastic versions of the bonus rule introduced in Sect. 2.2:

  1. (1)

    Prior-transition dependent bonus rule, and

  2. (2)

    Prior-action dependent bonus rule.

Here, the process is stochastic in the sense that when the agent tries to move in a certain direction, either diagonally up or down, it does so with probability p but it goes in the other direction with probability \(1-p\). In our numerical example in Fig. 8, we chose \(p = 0.9\) and the bonus value 4.

As described in Sect. 2.3, the classical DP is a model-based approach, requiring explicit transition and reward models to allow a DP solution based on the principle of optimality. To solve our longest path problem with bonus rule, we must enlarge the state space to make the process Markovian, by choosing proper arguments for the optimal value function; this is in general a matter of art (rather than science) [see a variety of examples in Bellman and Dreyfus (1962), Dreyfus and Law (1977)]. These arguments must explicitly define the appropriate amount of information given the bonus rule.

Appendix 1.1: DP solution to the prior-transition dependent problem

For the prior-transition dependent case, the state space must contain the information of the current two-dimensional coordinates plus the last two consecutive “actual” transitions no matter what actions were taken. Our classical backward DP-formulation needs the following four steps under the assumption of the unit discount factor in our finite-horizon path problem.

  • ① Definition of the optimal value function:

    $$\begin{aligned} \begin{array}{l} V(x,y,z_1,z_2) \mathop {=}\limits ^\mathrm{def}\; \text{ maximum } \text{ expected } \text{ reward-to-go, } \text{ starting } \text{ at } \text{ vertex }~(x,y)~\hbox {to end},\\ \text{ with }~transition~z_1~\hbox {one stage before, and}~z_2~\hbox {two stages before}. \end{array}\nonumber \\ \end{aligned}$$
    (16)
  • ② Recurrence relation using one-stage reward \(R_z(x,y)\) at the current vertex (xy) when action z [\(z=u\) (up) or d (down)] is taken:

    $$\begin{aligned} \begin{array}{l} V(x,y,u,d) = \text{ max } \left\{ \begin{array}{l} u: \begin{array}{l} p \left[ R_u(x,y) + V(x+1,y+1,u,u) \right] \\ +\,(1- p) \left[ R_d(x,y) + V(x+1,y-1,d,u) + \text{ bonus } \right] \\ \end{array}\\ d: \begin{array}{l} p \left[ R_d(x,y) + V(x+1,y-1,d,u) + \text{ bonus } \right] \\ +\,(1-p) \left[ R_u(x,y) + V(x+1,y+1,u,u) \right] \\ \end{array} \end{array} \right. \\ V(x,y,u,u) =\text{ max } \left\{ \begin{array}{l} u: \begin{array}{l} p \left[ R_u(x,y) + V(x+1,y+1,u,u) + \text{ bonus } \right] \\ +\,(1- p) \left[ R_d(x,y) + V(x+1,y-1,d,u) \right] \\ \end{array}\\ d: \begin{array}{l} p \left[ R_d(x,y) + V(x+1,y-1,d,u) \right] \\ +\,(1-p) \left[ R_u(x,y) + V(x+1,y+1,u,u) + \text{ bonus } \right] \\ \end{array} \end{array} \right. \end{array} \end{aligned}$$

    with obvious modifications for V(xydu) and V(xydd).

    • The best action z (u or d) selected for evaluating each \(V(x,y,z_1,z_2)\) above is stored into the so-called optimal policy function as

      $$\begin{aligned} \pi (x,y, z_1, z_2)=z. \end{aligned}$$
  • ③ Boundary condition at terminal stage \(N (=4)\) in Fig. 8:

    \(V(5,0,-,-)=-1\); \(V(5,2,-,-)=3\); \(V(5,4,-,-)=V(5,6,-,-)=0\); \(V(5,8,-,-)=1\).

  • ④ Answer is given by \(V(A,-,-)=V(1,4,-,-)\) at the initial vertex A (1, 4).   \(\square \)

Using full backups (see Fig. 3) with \(p=0.9\) and bonus value 4, the above DP procedure yields the optimal policy, which may be summarized with such eight possible transitional paths (or trajectories) as shown in Table 4 together with the simulation results obtained by model-free learning.

Appendix 1.2: DP solution to the prior-action dependent problem

Next, for solving the prior-action dependent case, the arguments of the optimal value function must explicitly define the current two-dimensional coordinates plus the last two consecutive actions taken whatever actual transitions followed. Here are the four steps of classical backward DP with the unit discount rate assumed:

  • ① Definition of the optimal value function:

    $$\begin{aligned} \begin{array}{l} V(x,y,z_1,z_2) \mathop {=}\limits ^\mathrm{def}\; \text{ maximum } \text{ expected } \text{ reward-to-go, } \text{ starting } \text{ at } \text{ vertex }~(x,y)~\hbox {to end},\\ \text{ with } \text{ action }~z_1~\hbox {one stage before, and}~z_2~\hbox {two stages before}. \end{array}\nonumber \\ \end{aligned}$$
    (17)
  • ② Recurrence relation using one-stage reward \(R_z(x,y)\) at the current vertex (xy) when action z [\(z=u\) (up) or d (down)] is taken:

    $$\begin{aligned} \begin{array}{l} V(x,y,u,d) =\text{ max } \left\{ \begin{array}{l} u: \begin{array}{l} p \left[ R_u(x,y) + V(x+1,y+1,u,u) \right] \\ +\,(1- p) \left[ R_d(x,y) + V(x+1,y-1,u,u) \right] \\ \end{array}\\ d: \begin{array}{l} p \left[ R_d(x,y) + V(x+1,y-1,d,u) + \text{ bonus } \right] \\ +\,(1-p) \left[ R_u(x,y) + V(x+1,y+1,d,u) + \text{ bonus } \right] \\ \end{array} \end{array} \right. \\ V(x,y,u,u) =\text{ max } \left\{ \begin{array}{l} u: \begin{array}{l} p \left[ R_u(x,y) + V(x+1,y+1,u,u) + \text{ bonus } \right] \\ +\,(1- p) \left[ R_d(x,y) + V(x+1,y-1,u,u) + \text{ bonus } \right] \\ \end{array}\\ d: \begin{array}{l} p \left[ R_d(x,y) + V(x+1,y-1,d,u) \right] \\ +\,(1-p) \left[ R_u(x,y) + V(x+1,y+1,d,u) \right] \\ \end{array} \end{array} \right. \end{array} \end{aligned}$$

    with obvious modifications for V(xydu) and V(xydd).

    • The best action z (u or d) selected for evaluating each \(V(x,y,z_1,z_2)\) above is stored into the so-called optimal policy function as

      $$\begin{aligned} \pi (x,y, z_1, z_2)=z. \end{aligned}$$
  • ③ Boundary condition at terminal stage \(N (=4)\) in Fig. 8:

    \(V(5,0,-,-)=-1\); \(V(5,2,-,-)=3\); \(V(5,4,-,-)=V(5,6,-,-)=0\); \(V(5,8,-,-)=1\).

  • ④ Answer is given by \(V(A,-,-)=V(1,4,-,-)\) at the initial vertex A (1, 4). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mizutani, E., Dreyfus, S. Totally model-free actor-critic recurrent neural-network reinforcement learning in non-Markovian domains. Ann Oper Res 258, 107–131 (2017). https://doi.org/10.1007/s10479-016-2366-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-016-2366-2

Keywords

Navigation