Advertisement

On discontinuous Q-Functions in reinforcement learning

  • Alexander Linden
Technical Papers
Part of the Lecture Notes in Computer Science book series (LNCS, volume 671)

Abstract

This paper considers the application of reinforcement learning to path finding tasks in continuous state space in the presence of obstacles. We show that cumulative evaluation functions (as Q-Functions [28] and V-Functions [4]) may be discontinuous if forbidden regions (as implied by obstacles) exist in state space. As the infinite number of states requires the use of function approximators such as backpropagation nets [16, 12, 24], we argue that these discontinuities imply severe difficulties in learning cumulative evaluation functions. The discontinuities we detected might also explain why recent applications of reinforcement learning systems to complex tasks [12] failed to show desired performance. In our conclusion, we outline some ideas to circumvent the problem.

Keywords

Machine learning reinforcement learning evaluation functions temporal difference learning backpropagation nets robotics path finding continuous state space obstacles 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    C. W. Anderson. Learning and problem solving with multilayer connectionist systems. Technical Report COINS TR 86-50, Dept. of Computer and Information Science, University of Massachusetts, Amherst, MA, 1986.Google Scholar
  2. 2.
    A. G. Barto. Simulation Experiments with Goal-Seeking Adaptive Elements. COINS, Amherst, Massachusetts, 1984. AFWAL-TR-84-1022.Google Scholar
  3. 3.
    A. G. Barto, S. J. Bradtke, and S. P. Singh. Real-time learning and control using asynchronous dynamic programming. Technical report, University of Massachusetts, Departement of Computer Science, Amherst MA 01003, August 1991.Google Scholar
  4. 4.
    A. G. Barto, R. S. Sutton, and C. J. C. H. Watkins. Learning and sequential decision making. Technical Report COINS 89-95, Department of Computer Science, University of Massachusetts, MA, September 1989.Google Scholar
  5. 5.
    A. G. Barto, R. S. Sutton, and C. J. C. H. Watkins. Learning and sequential decision making. In M. Gabriel and J. W. Moore, editors, Learning and Computational Neuroscience, pages 539–602. MIT Press, Massachusetts, 1990.Google Scholar
  6. 6.
    R. E. Bellman. Dynamic Programming. Princeton University Press, Princeton, New Jersey, 1957.Google Scholar
  7. 7.
    P. Dayan. The convergence of TD(λ) for general λ. Machine Learning Journal, 8(3/4), May 1992. Special Issue on Reinforcement Learning.Google Scholar
  8. 8.
    D. Fox, V. Heinze, K. Möller, and S. B. Thrun. Learning by error driven decomposition. In Proc. of NeuroNimes, France, 1991.Google Scholar
  9. 9.
    G. E. Hinton. Connectionist learning procedures. Artificial Intelligence, 40:185–234, 1989.CrossRefGoogle Scholar
  10. 10.
    R. E. Korf. Real-time heuristic search: New results. In AAAI-88, pages 139–143, 1988.Google Scholar
  11. 11.
    P. R. Kumar. A survery of some results in stochastic adaptive control. SIAM Journal of Control and Optimization, 23:329–380, 1985.CrossRefGoogle Scholar
  12. 12.
    L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning Journal, 8(3/4), 1992. Special Issue on Reinforcement Learning.Google Scholar
  13. 13.
    A. Linden. Untersuchung von Backpropagation in konnektionistischen Systemen. Diplomarbeit, Universität Bonn, Informatik-Institutsbericht Nr. 80, 1990.Google Scholar
  14. 14.
    J. del R. Millán and C. Torras. A reinforcement connectionist approach to robot path finding in non-maze-like environments. Machine Learning Journal, 8(3/4), May 1992. Special Issue on Reinforcement Learning.Google Scholar
  15. 15.
    M. Minsky. Steps toward artificial intelligence. In E.A. Feigenbaum and J. Feldman, editors, Computers and Thought, pages 406–450. McGraw-Hill, 1961.Google Scholar
  16. 16.
    D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing. Vol. I + II. MIT Press, 1986.Google Scholar
  17. 17.
    A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, pages 210–229, 1959. Reprinted in E.Ã. Feigenbaum and J. Feldman (Eds.) 1963, Computers and Thought, McGraw-Hill, New York.Google Scholar
  18. 18.
    A. L. Samuel. Some studies in machine learning using the game of checkers. ii — recent progress. IBM Journal on Research and Development, pages 601–617, 1967.Google Scholar
  19. 19.
    F. J. Śmieja and H. Mühlenbein. Reflective modular neural network systems. Technical Report 633, GMD, Sankt Augustin, Germany, February 1992.Google Scholar
  20. 20.
    F. J. Śmieja. Multiple network systems (MINOS) modules: Task division and module discrimination. In Proceedings of the 8th AISB conference on Artificial Intelligence, Leeds, 16–19 April, 1991, 1991.Google Scholar
  21. 21.
    R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, 1984.Google Scholar
  22. 22.
    R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.Google Scholar
  23. 23.
    R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. To appear in: Proceedings of the Seventh International Conference on Machine Learning, June 1990, 1990.Google Scholar
  24. 24.
    R. S. Sutton, A. G. Barto, and R. J. Williams. Reinforcement Learning is Direct Adaptive Optimal Control. In Proceedings of the 1991 American Control Conference, 1991.Google Scholar
  25. 25.
    G. Tesauro. Practical issues in temporal difference learning. Machine Learning Journal, 8(3/4), 1992. Special Issue on Reinforcement Learning.Google Scholar
  26. 26.
    S. B. Thrun. Efficient exploration in reinforcement learning. Technical Report CMU-CS-92-102, Carnegie Mellon University, Pittsburgh, Pennsylvania, January 1992.Google Scholar
  27. 27.
    C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College, 1989.Google Scholar
  28. 28.
    C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning Journal, 8(3/4), May 1992. Special Issue on Reinforcement Learning.Google Scholar
  29. 29.
    P. Werbos. Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, 3:179–189, 1990.CrossRefGoogle Scholar
  30. 30.
    P. Werbos and J. Titus. An empirical test of new forecasting methods derived from a theory of intelligence: The predicition of conflict in latin america. IEEE Transactions on Systems, Man, and Cybernetics, SMC-8:657–666, 1978.Google Scholar
  31. 31.
    Paul J. Werbos. Backpropagation and neurocontrol: A review and prospectus. In Proceedings of IJCNN89 Washington, pages I 209–216, 1989.Google Scholar
  32. 32.
    S. D. Whitehead. A study of cooperative mechanisms for faster reinforcement learning. Technical Report 365, University of Rochester, Computer Science Department, Rochester, NY, March 1991.Google Scholar
  33. 33.
    S. D. Whitehead and D. H. Ballard. Active perception and reinforcement learning. Neural Computation, 2:409–419, 1990.Google Scholar
  34. 34.
    R. J. Williams. Reinforcement-learning connectionist systems. Technical Report NU-CCS-87-3, College of Computer Science, Northeastern University, Boston, 1987.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1993

Authors and Affiliations

  • Alexander Linden
    • 1
  1. 1.AI Research DivisionGerman National Research Center for Computer Science (GMD)Sankt AugustinGermany

Personalised recommendations