Learning and exploitation do not conflict under minimax optimality

  • Csaba Szepesvári
Part II: Regular Papers
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1224)


We show that adaptive real time dynamic programming extended with the action selection strategy which chooses the best action according to the latest estimate of the cost function yields asymptotically optimal policies within finite time under the minimax optimality criterion. From this it follows that learning and exploitation do not conflict under this special optimality criterion. We relate this result to learning optimal strategies in repeated two-player zero-sum deterministic games.


reinforcement learning self-optimizing systems dynamic games 


  1. 1.
    A.G. Barto, S.J. Bradtke, and S.P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72:91–138, 1995. Technical Report 91-57, Computer Science Department, University of Massachusetts, Vol. 59., 1991.Google Scholar
  2. 2.
    Justin A. Boyan. Modular Neural Networks for Learning Context-Dependent Game Strategies. Master's thesis, Department of Engineering and Computer Laboratory, University of Cambridge, Cambridge, UK, August 1992.Google Scholar
  3. 3.
    E.B. Dynkin and A.A. Yushkevich. Controlled Markov Processes. Springer-Verlag, Berlin, 1979.Google Scholar
  4. 4.
    M. Heger. Risk-sensitive decision making. PhD thesis, Zentrum für Kognitionwissenschaften, Universität Bremen, FB3 Informatik, Postfach 330 440, 28334 Bremen, Germany, 1996.Google Scholar
  5. 5.
    R.E. Korf. Real-time heuristic search. Artificial Intelligence, 42:189–211, 1990.Google Scholar
  6. 6.
    M.L. Littman and Cs. Szepesvári. A Generalized Reinforcement Learning Model: convergence and applications. In Int. Conf. on Machine Learning, 1996. http://iserv.iki.kfki.hu/asl-publs.html.Google Scholar
  7. 7.
    Nicol N. Schraudolph, Peter Dayan, and Terrence J. Sejnowski. Using the TD(λ) algorithm to learn an evaluation function for the game of Go. In Advances in Neural Information Processing Systems 6, Morgan Kaufmann, San Mateo, CA, 1994.Google Scholar
  8. 8.
    L.S. Shapley. Stochastic games. Proceedings of the National Academy of Sciences of the United States of America, 39:1095–1100, 1953.Google Scholar
  9. 9.
    C. Stein. A two-sample test for a linear hypothesis whose power is independent of variance. Ann. Math. Statist., 16, 1945.Google Scholar
  10. 10.
    Cs. Szepesvári. Certainty equivalence policies are self-optimizing under minimax optimality. Technical Report 96–101, Research Group on Artificial Intelligence, JATE-MTA, Szeged 6720, Aradi vrt tere 1., HUNGARY, August 1996. URL: http://www.inf.u-szeged.hu/rgai.Google Scholar
  11. 11.
    Cs. Szepesvári. Some basic facts concerning minimax sequential decision problems. Technical Report 96–100, Research Group on Artificial Intelligence, JATE-MTA, Szeged 6720, Aradi vrt tere 1., HUNGARY, August 1996. URL: http://www.inf.uszeged.hu/rgai.Google Scholar
  12. 12.
    Cs. Szepesvári and M. Littman. Generalized Markov Decision Processes: Dynamic programming and reinforcement learning algorithms. Operations Research, 1996. in preparation.Google Scholar
  13. 13.
    Gerald Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM, 58–67, March 1995.Google Scholar
  14. 14.
    Sebastian Thrun. Learning to play the game of chess. In Neural Information Processing Systems 7, 1995.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1997

Authors and Affiliations

  • Csaba Szepesvári
    • 1
  1. 1.Research Group on Artificial Intelligence“József Attila” UniversitySzegedHungary

Personalised recommendations