Most game programs have a large number of parameters that are crucial for their performance. While tuning these parameters by hand is rather difficult, efficient and easy to use generic automatic parameter optimisation algorithms are known only for special problems such as the adjustment of the parameters of an evaluation function. The SPSA algorithm (Simultaneous Perturbation Stochastic Approximation) is a generic stochastic gradient method for optimising an objective function when an analytic expression of the gradient is not available, a frequent case in game programs. Further, SPSA in its canonical form is very easy to implement. As such, it is an attractive choice for parameter optimisation in game programs, both due to its generality and simplicity. The goal of this paper is twofold: (i) to introduce SPSA for the game programming community by putting it into a game-programming perspective, and (ii) to propose and discuss several methods that can be used to enhance the performance of SPSA. These methods include using common random numbers and antithetic variables, a combination of SPSA with RPROP, and the reuse of samples of previous performance evaluations. SPSA with the proposed enhancements was tested in some large-scale experiments on tuning the parameters of an opponent model, a policy and an evaluation function in our poker program, MCRAISE. Whilst SPSA with no enhancements failed to make progress using the allocated resources, SPSA with the enhancements proved to be competitive with other methods, including TD-learning; increasing the average payoff per game by as large as 0.19 times the size of the amount of the small bet. From the experimental study, we conclude that the use of an appropriately enhanced variant of SPSA for the optimisation of game program parameters is a viable approach, especially if no good alternative exist for the types of parameters considered.
Anastasiadis, A. D., Magoulas, G. D., & Vrahatis, M. N. (2005). New globally convergent training scheme based on the resilient propagation algorithm. Neurocomputing, 64, 253–270.
Andradóttir, S. (1998). A review of simulation optimization techniques. In Proceeding of the 1998 Winter Simulation Conference (pp. 151–158).
Baird, L. & Moore, A. W. (1999). Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11 (pp. 968–974). Cambridge MA: MIT Press.
Baxter, J., & Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319–350.
Baxter, J., Tridgell, A., & Weaver, L. (2000). Learning to play chess using temporal differences. Machine Learning, 40(3), 243–263.
Billings, D., Burch, N., Davidson, A., Holte, R., Schaeffer, J., Schauenberg, T., & Szafron, D. (2003). Approximating game-theoretic optimal strategies for full-scale poker. In Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (pp. 661–668).
Billings, D., Davidson, A., Schaeffer, J., & Szafron, D. (2002). The challenge of poker. Artificial Intelligence, 134, 201–240.
Billings, D., Davidson, A., Shauenberg, T., Burch, N., Bowling, M., Holte, R., Schaeffer, J., & Szafron, D. (2004). Game tree search with adaptation in stochastic imperfect information games. In Proceedings of Computers and Games (CG’04).
Björnsson, Y., & Marsland, T. A. (2003). Learning extension parameters in game-tree search. Journal of Information Sciences, 154, 95–118.
Blum, J. R. (1954). Multidimensional stochastic approximation methods. Annals of Mathematical Statistics, 25, 737–744.
Bowling, M., & Veloso, M. (2002). Scalable learning in stochastic games. In AAAI Workshop on Game Theoretic and Decision Theoretic Agents.
Chellapilla, K., & Fogel, D. B. (1999). Evolving neural networks to play checkers without expert knowledge’. IEEE Transactions on Neural Networks, 10(6), 1382–1391.
Chen, H. (1988). Lower rate convergence for locating a maximum of a function. Annals of Statistics, 16, 1330–1334.
Dippon, J. (2003). Accelerated randomized stochastic optimization. Annals of Statistics, 31(4), 1260–1281.
Douc, R., Cappé, O., & Moulines, E. (2005). Comparison of resampling schemes for particle filtering. In 4th International Symposium on Image and Signal Processing and Analysis (ISPA).
Fabian, V. (1968). On asymptotic normality in stochastic approximation. Annals of Mathematical Statistics, 39, 1327–1332.
Gerencsér, L., Hill, S. D., & Vágó, Z. (1999). Optimization over discrete sets via SPSA. In Proceedings of the 1999 Winter Simulation Conference (pp. 466–470).
Gerencsér, L., Kozmann, G., & Vágó, Z. (1998). Non-smooth optimization via SPSA. In Proceedings of the Conference on the Mathematical Theory of Networks and Systems MTNS 98 (pp. 803–806).
Glasserman, P., & Yao, D. D. (1992). Some guidelines and guarantees for common random numbers. Management Science, 38, 884–908.
Greensmith, E., Bartlett, P. L., & Baxter, J. (2002). Variance reduction techniques for gradient estimates in reinforcement learning. In Advances in Neural Information Processing Systems 14 (pp. 1507–1514).
He, Y., Fu, M. C., & Marcus, S. I. (2003). Convergence of simultaneous perturbation stochastic approximation for nondifferentiable optimization. IEEE Transactions on Automatic Control, 48, 1459–1463.
Igel, C., & Hüsken, M. (2000), Improving the Rprop learning algorithm. In H. Bothe, & R. Rojas (Eds.), Proceedings of the second international ICSC symposium on neural computation (NC 2000) (pp. 115–121). ICSC Academic Press.
Igel, C., & Hüsken, M. (2003). Empirical evaluation of the improved Rprop learning algorithm. Neurocomputing, 50(C), 105–123.
Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML 2002) (pp. 267–274).
Kiefer, J., & Wolfowitz, J. (1952). Stochastic estimation of the maximum of a regression function. Annals of Mathematical Statistics, 23, 462–466.
Kleinman, N. L., Spall, J. C., & Neiman, D. Q. (1999). Simulation-based optimization with stochastic approximation using common random numbers. Management Science, 45(11), 1570–1578.
Kocsis, L. (2003). Learning search decisions. Ph.D. thesis, Universiteit Maastricht, The Netherlands.
Kocsis, L., & Szepesvári, Cs. (2005). Reduced-variance payoff estimation in adversarial bandit problems. In Proceedings of the ECML’05 Workshop on Reinforcement Learning in Non-Stationary Environments (in print).
Kocsis, L., Szepesvári, Cs., & Winands, M. H. M. (2005). RSPSA: Enhanced parameter optimisation in games. In Proceedings of the 11th Advances in Computer Games Conference (ACG-11), in press.
Kushner, H. J., & Yin, G. G. (1997). Stochastic approximation algorithms and applications. New York: Springer.
L’Ecuyer, P., & Yin, G. (1998). Budget-dependent convergence rate of stochastic approximation. SIAM J. on Optimization, 8(1), 217–247.
Polyak, B. T., & Tsybakov, A. B. (1990). Optimal orders of accuracy for search algorithms of stochastic optimization. Problems of Information Transmission, 26, 126–133.
Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning The RPROP algorithm. In E. H. Ruspini (Eds.), Proceedings of the IEEE international conference on neural networks (pp. 586–591). IEEE Press.
Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22, 400–407.
Rubinstein, R. Y., Samorodnitsky, G., & Shaked, M. (1985). Antithetic variables, multivariate dependence and simulation of complex stochastic systems. Management Sciences, 31, 66–77.
Sadegh, P. & Spall, J. C. (1997). Optimal random perturbations for stochastic approximation using a simultaneous perturbation gradient approximation. In Proceedings of the American Control Conference, Albuquerque, NM (pp. 3582–3586).
Schraudolph, N. (1999). Local gain adaptation in stochastic gradient descent. In Proc. 9th International Conference on Artificial Neural Networks, Edinburgh (pp. 569–574). London: IEE.
Schraudolph, N. N. & Graepel, T. (2002). Towards stochastic conjugate gradient methods. In Proceedings of the 9th International Conference on Neural Information Processing (pp. 1351–1358).
Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37, 332–341.
Spall, J. C. (2000). Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Transactions on Automatic Control, 45, 1839–1853.
Spall, J. C. (2003). Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Hoboken, NJ: Wiley.
Sutton, R. & Barto, A. (1998). Reinforcement learning: An introduction. Bradford Book.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12 (pp. 1057–1063), MIT Press, Cambridge MA.
Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, 8, 257–277.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.
Winands, M. H. M., Kocsis, L., Uiterwijk, J. W. H. M., & Van den Herik, H. J. (2002). Temporal difference learning and the neural movemap heuristic in the game of lines of action. In Proceedings of 3rd International Conference on Intelligent Games and Simulation (GAME-ON 2002) (pp. 99–103).
Xiong, X., Wang, I.-J., & Fu, M. C. (2002). Randomized-direction stochastic approximation algorithms using deterministic sequences. In Proceedings of the 2002 Winter Simulation Conference, San Diego, CA (pp. 285–291).
Editors: Michael Bowling · Johannes Fürnkranz · Thore Graepel · Ron Musick
About this article
Cite this article
Kocsis, L., Szepesvári, C. Universal parameter optimisation in games based on SPSA. Mach Learn 63, 249–286 (2006). https://doi.org/10.1007/s10994-006-6888-8
- Stochastic gradient ascent