Machine Learning

, Volume 63, Issue 3, pp 249–286 | Cite as

Universal parameter optimisation in games based on SPSA

  • Levente Kocsis
  • Csaba SzepesváriEmail author


Most game programs have a large number of parameters that are crucial for their performance. While tuning these parameters by hand is rather difficult, efficient and easy to use generic automatic parameter optimisation algorithms are known only for special problems such as the adjustment of the parameters of an evaluation function. The SPSA algorithm (Simultaneous Perturbation Stochastic Approximation) is a generic stochastic gradient method for optimising an objective function when an analytic expression of the gradient is not available, a frequent case in game programs. Further, SPSA in its canonical form is very easy to implement. As such, it is an attractive choice for parameter optimisation in game programs, both due to its generality and simplicity. The goal of this paper is twofold: (i) to introduce SPSA for the game programming community by putting it into a game-programming perspective, and (ii) to propose and discuss several methods that can be used to enhance the performance of SPSA. These methods include using common random numbers and antithetic variables, a combination of SPSA with RPROP, and the reuse of samples of previous performance evaluations. SPSA with the proposed enhancements was tested in some large-scale experiments on tuning the parameters of an opponent model, a policy and an evaluation function in our poker program, MCRAISE. Whilst SPSA with no enhancements failed to make progress using the allocated resources, SPSA with the enhancements proved to be competitive with other methods, including TD-learning; increasing the average payoff per game by as large as 0.19 times the size of the amount of the small bet. From the experimental study, we conclude that the use of an appropriately enhanced variant of SPSA for the optimisation of game program parameters is a viable approach, especially if no good alternative exist for the types of parameters considered.


SPSA Stochastic gradient ascent Games Learning Poker 


  1. Anastasiadis, A. D., Magoulas, G. D., & Vrahatis, M. N. (2005). New globally convergent training scheme based on the resilient propagation algorithm. Neurocomputing, 64, 253–270.CrossRefGoogle Scholar
  2. Andradóttir, S. (1998). A review of simulation optimization techniques. In Proceeding of the 1998 Winter Simulation Conference (pp. 151–158).Google Scholar
  3. Baird, L. & Moore, A. W. (1999). Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11 (pp. 968–974). Cambridge MA: MIT Press.Google Scholar
  4. Baxter, J., & Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319–350.MathSciNetCrossRefzbMATHGoogle Scholar
  5. Baxter, J., Tridgell, A., & Weaver, L. (2000). Learning to play chess using temporal differences. Machine Learning, 40(3), 243–263.CrossRefzbMATHGoogle Scholar
  6. Billings, D., Burch, N., Davidson, A., Holte, R., Schaeffer, J., Schauenberg, T., & Szafron, D. (2003). Approximating game-theoretic optimal strategies for full-scale poker. In Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (pp. 661–668).Google Scholar
  7. Billings, D., Davidson, A., Schaeffer, J., & Szafron, D. (2002). The challenge of poker. Artificial Intelligence, 134, 201–240.CrossRefzbMATHGoogle Scholar
  8. Billings, D., Davidson, A., Shauenberg, T., Burch, N., Bowling, M., Holte, R., Schaeffer, J., & Szafron, D. (2004). Game tree search with adaptation in stochastic imperfect information games. In Proceedings of Computers and Games (CG’04).Google Scholar
  9. Björnsson, Y., & Marsland, T. A. (2003). Learning extension parameters in game-tree search. Journal of Information Sciences, 154, 95–118.CrossRefGoogle Scholar
  10. Blum, J. R. (1954). Multidimensional stochastic approximation methods. Annals of Mathematical Statistics, 25, 737–744.zbMATHMathSciNetGoogle Scholar
  11. Bowling, M., & Veloso, M. (2002). Scalable learning in stochastic games. In AAAI Workshop on Game Theoretic and Decision Theoretic Agents.Google Scholar
  12. Chellapilla, K., & Fogel, D. B. (1999). Evolving neural networks to play checkers without expert knowledge’. IEEE Transactions on Neural Networks, 10(6), 1382–1391.CrossRefGoogle Scholar
  13. Chen, H. (1988). Lower rate convergence for locating a maximum of a function. Annals of Statistics, 16, 1330–1334.zbMATHMathSciNetGoogle Scholar
  14. Dippon, J. (2003). Accelerated randomized stochastic optimization. Annals of Statistics, 31(4), 1260–1281.zbMATHMathSciNetCrossRefGoogle Scholar
  15. Douc, R., Cappé, O., & Moulines, E. (2005). Comparison of resampling schemes for particle filtering. In 4th International Symposium on Image and Signal Processing and Analysis (ISPA).Google Scholar
  16. Fabian, V. (1968). On asymptotic normality in stochastic approximation. Annals of Mathematical Statistics, 39, 1327–1332.Google Scholar
  17. Gerencsér, L., Hill, S. D., & Vágó, Z. (1999). Optimization over discrete sets via SPSA. In Proceedings of the 1999 Winter Simulation Conference (pp. 466–470).Google Scholar
  18. Gerencsér, L., Kozmann, G., & Vágó, Z. (1998). Non-smooth optimization via SPSA. In Proceedings of the Conference on the Mathematical Theory of Networks and Systems MTNS 98 (pp. 803–806).Google Scholar
  19. Glasserman, P., & Yao, D. D. (1992). Some guidelines and guarantees for common random numbers. Management Science, 38, 884–908.zbMATHGoogle Scholar
  20. Greensmith, E., Bartlett, P. L., & Baxter, J. (2002). Variance reduction techniques for gradient estimates in reinforcement learning. In Advances in Neural Information Processing Systems 14 (pp. 1507–1514).Google Scholar
  21. He, Y., Fu, M. C., & Marcus, S. I. (2003). Convergence of simultaneous perturbation stochastic approximation for nondifferentiable optimization. IEEE Transactions on Automatic Control, 48, 1459–1463.MathSciNetCrossRefGoogle Scholar
  22. Igel, C., & Hüsken, M. (2000), Improving the Rprop learning algorithm. In H. Bothe, & R. Rojas (Eds.), Proceedings of the second international ICSC symposium on neural computation (NC 2000) (pp. 115–121). ICSC Academic Press.Google Scholar
  23. Igel, C., & Hüsken, M. (2003). Empirical evaluation of the improved Rprop learning algorithm. Neurocomputing, 50(C), 105–123.CrossRefzbMATHGoogle Scholar
  24. Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML 2002) (pp. 267–274).Google Scholar
  25. Kiefer, J., & Wolfowitz, J. (1952). Stochastic estimation of the maximum of a regression function. Annals of Mathematical Statistics, 23, 462–466.MathSciNetzbMATHGoogle Scholar
  26. Kleinman, N. L., Spall, J. C., & Neiman, D. Q. (1999). Simulation-based optimization with stochastic approximation using common random numbers. Management Science, 45(11), 1570–1578.zbMATHGoogle Scholar
  27. Kocsis, L. (2003). Learning search decisions. Ph.D. thesis, Universiteit Maastricht, The Netherlands.Google Scholar
  28. Kocsis, L., & Szepesvári, Cs. (2005). Reduced-variance payoff estimation in adversarial bandit problems. In Proceedings of the ECML’05 Workshop on Reinforcement Learning in Non-Stationary Environments (in print).Google Scholar
  29. Kocsis, L., Szepesvári, Cs., & Winands, M. H. M. (2005). RSPSA: Enhanced parameter optimisation in games. In Proceedings of the 11th Advances in Computer Games Conference (ACG-11), in press.Google Scholar
  30. Kushner, H. J., & Yin, G. G. (1997). Stochastic approximation algorithms and applications. New York: Springer.zbMATHGoogle Scholar
  31. L’Ecuyer, P., & Yin, G. (1998). Budget-dependent convergence rate of stochastic approximation. SIAM J. on Optimization, 8(1), 217–247.MathSciNetCrossRefzbMATHGoogle Scholar
  32. Polyak, B. T., & Tsybakov, A. B. (1990). Optimal orders of accuracy for search algorithms of stochastic optimization. Problems of Information Transmission, 26, 126–133.MathSciNetzbMATHGoogle Scholar
  33. Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning The RPROP algorithm. In E. H. Ruspini (Eds.), Proceedings of the IEEE international conference on neural networks (pp. 586–591). IEEE Press.Google Scholar
  34. Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22, 400–407.MathSciNetzbMATHGoogle Scholar
  35. Rubinstein, R. Y., Samorodnitsky, G., & Shaked, M. (1985). Antithetic variables, multivariate dependence and simulation of complex stochastic systems. Management Sciences, 31, 66–77.MathSciNetCrossRefzbMATHGoogle Scholar
  36. Sadegh, P. & Spall, J. C. (1997). Optimal random perturbations for stochastic approximation using a simultaneous perturbation gradient approximation. In Proceedings of the American Control Conference, Albuquerque, NM (pp. 3582–3586).Google Scholar
  37. Schraudolph, N. (1999). Local gain adaptation in stochastic gradient descent. In Proc. 9th International Conference on Artificial Neural Networks, Edinburgh (pp. 569–574). London: IEE.Google Scholar
  38. Schraudolph, N. N. & Graepel, T. (2002). Towards stochastic conjugate gradient methods. In Proceedings of the 9th International Conference on Neural Information Processing (pp. 1351–1358).Google Scholar
  39. Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37, 332–341.zbMATHMathSciNetCrossRefGoogle Scholar
  40. Spall, J. C. (2000). Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Transactions on Automatic Control, 45, 1839–1853.zbMATHMathSciNetCrossRefGoogle Scholar
  41. Spall, J. C. (2003). Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Hoboken, NJ: Wiley.zbMATHGoogle Scholar
  42. Sutton, R. & Barto, A. (1998). Reinforcement learning: An introduction. Bradford Book.Google Scholar
  43. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.Google Scholar
  44. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12 (pp. 1057–1063), MIT Press, Cambridge MA.Google Scholar
  45. Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, 8, 257–277.zbMATHGoogle Scholar
  46. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.zbMATHGoogle Scholar
  47. Winands, M. H. M., Kocsis, L., Uiterwijk, J. W. H. M., & Van den Herik, H. J. (2002). Temporal difference learning and the neural movemap heuristic in the game of lines of action. In Proceedings of 3rd International Conference on Intelligent Games and Simulation (GAME-ON 2002) (pp. 99–103).Google Scholar
  48. Xiong, X., Wang, I.-J., & Fu, M. C. (2002). Randomized-direction stochastic approximation algorithms using deterministic sequences. In Proceedings of the 2002 Winter Simulation Conference, San Diego, CA (pp. 285–291).Google Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  1. 1.MTA SZTAKIBudapestHungary

Personalised recommendations