Universal parameter optimisation in games based on SPSA

Abstract

Most game programs have a large number of parameters that are crucial for their performance. While tuning these parameters by hand is rather difficult, efficient and easy to use generic automatic parameter optimisation algorithms are known only for special problems such as the adjustment of the parameters of an evaluation function. The SPSA algorithm (Simultaneous Perturbation Stochastic Approximation) is a generic stochastic gradient method for optimising an objective function when an analytic expression of the gradient is not available, a frequent case in game programs. Further, SPSA in its canonical form is very easy to implement. As such, it is an attractive choice for parameter optimisation in game programs, both due to its generality and simplicity. The goal of this paper is twofold: (i) to introduce SPSA for the game programming community by putting it into a game-programming perspective, and (ii) to propose and discuss several methods that can be used to enhance the performance of SPSA. These methods include using common random numbers and antithetic variables, a combination of SPSA with RPROP, and the reuse of samples of previous performance evaluations. SPSA with the proposed enhancements was tested in some large-scale experiments on tuning the parameters of an opponent model, a policy and an evaluation function in our poker program, MCRAISE. Whilst SPSA with no enhancements failed to make progress using the allocated resources, SPSA with the enhancements proved to be competitive with other methods, including TD-learning; increasing the average payoff per game by as large as 0.19 times the size of the amount of the small bet. From the experimental study, we conclude that the use of an appropriately enhanced variant of SPSA for the optimisation of game program parameters is a viable approach, especially if no good alternative exist for the types of parameters considered.

References

  1. Anastasiadis, A. D., Magoulas, G. D., & Vrahatis, M. N. (2005). New globally convergent training scheme based on the resilient propagation algorithm. Neurocomputing, 64, 253–270.

    Article  Google Scholar 

  2. Andradóttir, S. (1998). A review of simulation optimization techniques. In Proceeding of the 1998 Winter Simulation Conference (pp. 151–158).

  3. Baird, L. & Moore, A. W. (1999). Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11 (pp. 968–974). Cambridge MA: MIT Press.

  4. Baxter, J., & Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319–350.

    MathSciNet  Article  MATH  Google Scholar 

  5. Baxter, J., Tridgell, A., & Weaver, L. (2000). Learning to play chess using temporal differences. Machine Learning, 40(3), 243–263.

    Article  MATH  Google Scholar 

  6. Billings, D., Burch, N., Davidson, A., Holte, R., Schaeffer, J., Schauenberg, T., & Szafron, D. (2003). Approximating game-theoretic optimal strategies for full-scale poker. In Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (pp. 661–668).

  7. Billings, D., Davidson, A., Schaeffer, J., & Szafron, D. (2002). The challenge of poker. Artificial Intelligence, 134, 201–240.

    Article  MATH  Google Scholar 

  8. Billings, D., Davidson, A., Shauenberg, T., Burch, N., Bowling, M., Holte, R., Schaeffer, J., & Szafron, D. (2004). Game tree search with adaptation in stochastic imperfect information games. In Proceedings of Computers and Games (CG’04).

  9. Björnsson, Y., & Marsland, T. A. (2003). Learning extension parameters in game-tree search. Journal of Information Sciences, 154, 95–118.

    Article  Google Scholar 

  10. Blum, J. R. (1954). Multidimensional stochastic approximation methods. Annals of Mathematical Statistics, 25, 737–744.

    MATH  MathSciNet  Google Scholar 

  11. Bowling, M., & Veloso, M. (2002). Scalable learning in stochastic games. In AAAI Workshop on Game Theoretic and Decision Theoretic Agents.

  12. Chellapilla, K., & Fogel, D. B. (1999). Evolving neural networks to play checkers without expert knowledge’. IEEE Transactions on Neural Networks, 10(6), 1382–1391.

    Article  Google Scholar 

  13. Chen, H. (1988). Lower rate convergence for locating a maximum of a function. Annals of Statistics, 16, 1330–1334.

    MATH  MathSciNet  Google Scholar 

  14. Dippon, J. (2003). Accelerated randomized stochastic optimization. Annals of Statistics, 31(4), 1260–1281.

    MATH  MathSciNet  Article  Google Scholar 

  15. Douc, R., Cappé, O., & Moulines, E. (2005). Comparison of resampling schemes for particle filtering. In 4th International Symposium on Image and Signal Processing and Analysis (ISPA).

  16. Fabian, V. (1968). On asymptotic normality in stochastic approximation. Annals of Mathematical Statistics, 39, 1327–1332.

    Google Scholar 

  17. Gerencsér, L., Hill, S. D., & Vágó, Z. (1999). Optimization over discrete sets via SPSA. In Proceedings of the 1999 Winter Simulation Conference (pp. 466–470).

  18. Gerencsér, L., Kozmann, G., & Vágó, Z. (1998). Non-smooth optimization via SPSA. In Proceedings of the Conference on the Mathematical Theory of Networks and Systems MTNS 98 (pp. 803–806).

  19. Glasserman, P., & Yao, D. D. (1992). Some guidelines and guarantees for common random numbers. Management Science, 38, 884–908.

    MATH  Google Scholar 

  20. Greensmith, E., Bartlett, P. L., & Baxter, J. (2002). Variance reduction techniques for gradient estimates in reinforcement learning. In Advances in Neural Information Processing Systems 14 (pp. 1507–1514).

  21. He, Y., Fu, M. C., & Marcus, S. I. (2003). Convergence of simultaneous perturbation stochastic approximation for nondifferentiable optimization. IEEE Transactions on Automatic Control, 48, 1459–1463.

    MathSciNet  Article  Google Scholar 

  22. Igel, C., & Hüsken, M. (2000), Improving the Rprop learning algorithm. In H. Bothe, & R. Rojas (Eds.), Proceedings of the second international ICSC symposium on neural computation (NC 2000) (pp. 115–121). ICSC Academic Press.

  23. Igel, C., & Hüsken, M. (2003). Empirical evaluation of the improved Rprop learning algorithm. Neurocomputing, 50(C), 105–123.

    Article  MATH  Google Scholar 

  24. Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML 2002) (pp. 267–274).

  25. Kiefer, J., & Wolfowitz, J. (1952). Stochastic estimation of the maximum of a regression function. Annals of Mathematical Statistics, 23, 462–466.

    MathSciNet  MATH  Google Scholar 

  26. Kleinman, N. L., Spall, J. C., & Neiman, D. Q. (1999). Simulation-based optimization with stochastic approximation using common random numbers. Management Science, 45(11), 1570–1578.

    MATH  Google Scholar 

  27. Kocsis, L. (2003). Learning search decisions. Ph.D. thesis, Universiteit Maastricht, The Netherlands.

  28. Kocsis, L., & Szepesvári, Cs. (2005). Reduced-variance payoff estimation in adversarial bandit problems. In Proceedings of the ECML’05 Workshop on Reinforcement Learning in Non-Stationary Environments (in print).

  29. Kocsis, L., Szepesvári, Cs., & Winands, M. H. M. (2005). RSPSA: Enhanced parameter optimisation in games. In Proceedings of the 11th Advances in Computer Games Conference (ACG-11), in press.

  30. Kushner, H. J., & Yin, G. G. (1997). Stochastic approximation algorithms and applications. New York: Springer.

    MATH  Google Scholar 

  31. L’Ecuyer, P., & Yin, G. (1998). Budget-dependent convergence rate of stochastic approximation. SIAM J. on Optimization, 8(1), 217–247.

    MathSciNet  Article  MATH  Google Scholar 

  32. Polyak, B. T., & Tsybakov, A. B. (1990). Optimal orders of accuracy for search algorithms of stochastic optimization. Problems of Information Transmission, 26, 126–133.

    MathSciNet  MATH  Google Scholar 

  33. Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning The RPROP algorithm. In E. H. Ruspini (Eds.), Proceedings of the IEEE international conference on neural networks (pp. 586–591). IEEE Press.

  34. Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22, 400–407.

    MathSciNet  MATH  Google Scholar 

  35. Rubinstein, R. Y., Samorodnitsky, G., & Shaked, M. (1985). Antithetic variables, multivariate dependence and simulation of complex stochastic systems. Management Sciences, 31, 66–77.

    MathSciNet  Article  MATH  Google Scholar 

  36. Sadegh, P. & Spall, J. C. (1997). Optimal random perturbations for stochastic approximation using a simultaneous perturbation gradient approximation. In Proceedings of the American Control Conference, Albuquerque, NM (pp. 3582–3586).

  37. Schraudolph, N. (1999). Local gain adaptation in stochastic gradient descent. In Proc. 9th International Conference on Artificial Neural Networks, Edinburgh (pp. 569–574). London: IEE.

  38. Schraudolph, N. N. & Graepel, T. (2002). Towards stochastic conjugate gradient methods. In Proceedings of the 9th International Conference on Neural Information Processing (pp. 1351–1358).

  39. Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37, 332–341.

    MATH  MathSciNet  Article  Google Scholar 

  40. Spall, J. C. (2000). Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Transactions on Automatic Control, 45, 1839–1853.

    MATH  MathSciNet  Article  Google Scholar 

  41. Spall, J. C. (2003). Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Hoboken, NJ: Wiley.

    MATH  Google Scholar 

  42. Sutton, R. & Barto, A. (1998). Reinforcement learning: An introduction. Bradford Book.

  43. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.

    Google Scholar 

  44. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12 (pp. 1057–1063), MIT Press, Cambridge MA.

  45. Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, 8, 257–277.

    MATH  Google Scholar 

  46. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.

    MATH  Google Scholar 

  47. Winands, M. H. M., Kocsis, L., Uiterwijk, J. W. H. M., & Van den Herik, H. J. (2002). Temporal difference learning and the neural movemap heuristic in the game of lines of action. In Proceedings of 3rd International Conference on Intelligent Games and Simulation (GAME-ON 2002) (pp. 99–103).

  48. Xiong, X., Wang, I.-J., & Fu, M. C. (2002). Randomized-direction stochastic approximation algorithms using deterministic sequences. In Proceedings of the 2002 Winter Simulation Conference, San Diego, CA (pp. 285–291).

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Csaba Szepesvári.

Additional information

Editors: Michael Bowling · Johannes Fürnkranz · Thore Graepel · Ron Musick

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Kocsis, L., Szepesvári, C. Universal parameter optimisation in games based on SPSA. Mach Learn 63, 249–286 (2006). https://doi.org/10.1007/s10994-006-6888-8

Download citation

Keywords

  • SPSA
  • Stochastic gradient ascent
  • Games
  • Learning
  • Poker