Universal parameter optimisation in games based on SPSA
- 470 Downloads
- 8 Citations
Abstract
Most game programs have a large number of parameters that are crucial for their performance. While tuning these parameters by hand is rather difficult, efficient and easy to use generic automatic parameter optimisation algorithms are known only for special problems such as the adjustment of the parameters of an evaluation function. The SPSA algorithm (Simultaneous Perturbation Stochastic Approximation) is a generic stochastic gradient method for optimising an objective function when an analytic expression of the gradient is not available, a frequent case in game programs. Further, SPSA in its canonical form is very easy to implement. As such, it is an attractive choice for parameter optimisation in game programs, both due to its generality and simplicity. The goal of this paper is twofold: (i) to introduce SPSA for the game programming community by putting it into a game-programming perspective, and (ii) to propose and discuss several methods that can be used to enhance the performance of SPSA. These methods include using common random numbers and antithetic variables, a combination of SPSA with RPROP, and the reuse of samples of previous performance evaluations. SPSA with the proposed enhancements was tested in some large-scale experiments on tuning the parameters of an opponent model, a policy and an evaluation function in our poker program, MCRAISE. Whilst SPSA with no enhancements failed to make progress using the allocated resources, SPSA with the enhancements proved to be competitive with other methods, including TD-learning; increasing the average payoff per game by as large as 0.19 times the size of the amount of the small bet. From the experimental study, we conclude that the use of an appropriately enhanced variant of SPSA for the optimisation of game program parameters is a viable approach, especially if no good alternative exist for the types of parameters considered.
Keywords
SPSA Stochastic gradient ascent Games Learning PokerReferences
- Anastasiadis, A. D., Magoulas, G. D., & Vrahatis, M. N. (2005). New globally convergent training scheme based on the resilient propagation algorithm. Neurocomputing, 64, 253–270.CrossRefGoogle Scholar
- Andradóttir, S. (1998). A review of simulation optimization techniques. In Proceeding of the 1998 Winter Simulation Conference (pp. 151–158).Google Scholar
- Baird, L. & Moore, A. W. (1999). Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11 (pp. 968–974). Cambridge MA: MIT Press.Google Scholar
- Baxter, J., & Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319–350.MathSciNetCrossRefMATHGoogle Scholar
- Baxter, J., Tridgell, A., & Weaver, L. (2000). Learning to play chess using temporal differences. Machine Learning, 40(3), 243–263.CrossRefMATHGoogle Scholar
- Billings, D., Burch, N., Davidson, A., Holte, R., Schaeffer, J., Schauenberg, T., & Szafron, D. (2003). Approximating game-theoretic optimal strategies for full-scale poker. In Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (pp. 661–668).Google Scholar
- Billings, D., Davidson, A., Schaeffer, J., & Szafron, D. (2002). The challenge of poker. Artificial Intelligence, 134, 201–240.CrossRefMATHGoogle Scholar
- Billings, D., Davidson, A., Shauenberg, T., Burch, N., Bowling, M., Holte, R., Schaeffer, J., & Szafron, D. (2004). Game tree search with adaptation in stochastic imperfect information games. In Proceedings of Computers and Games (CG’04).Google Scholar
- Björnsson, Y., & Marsland, T. A. (2003). Learning extension parameters in game-tree search. Journal of Information Sciences, 154, 95–118.CrossRefGoogle Scholar
- Blum, J. R. (1954). Multidimensional stochastic approximation methods. Annals of Mathematical Statistics, 25, 737–744.MATHMathSciNetGoogle Scholar
- Bowling, M., & Veloso, M. (2002). Scalable learning in stochastic games. In AAAI Workshop on Game Theoretic and Decision Theoretic Agents.Google Scholar
- Chellapilla, K., & Fogel, D. B. (1999). Evolving neural networks to play checkers without expert knowledge’. IEEE Transactions on Neural Networks, 10(6), 1382–1391.CrossRefGoogle Scholar
- Chen, H. (1988). Lower rate convergence for locating a maximum of a function. Annals of Statistics, 16, 1330–1334.MATHMathSciNetGoogle Scholar
- Dippon, J. (2003). Accelerated randomized stochastic optimization. Annals of Statistics, 31(4), 1260–1281.MATHMathSciNetCrossRefGoogle Scholar
- Douc, R., Cappé, O., & Moulines, E. (2005). Comparison of resampling schemes for particle filtering. In 4th International Symposium on Image and Signal Processing and Analysis (ISPA).Google Scholar
- Fabian, V. (1968). On asymptotic normality in stochastic approximation. Annals of Mathematical Statistics, 39, 1327–1332.Google Scholar
- Gerencsér, L., Hill, S. D., & Vágó, Z. (1999). Optimization over discrete sets via SPSA. In Proceedings of the 1999 Winter Simulation Conference (pp. 466–470).Google Scholar
- Gerencsér, L., Kozmann, G., & Vágó, Z. (1998). Non-smooth optimization via SPSA. In Proceedings of the Conference on the Mathematical Theory of Networks and Systems MTNS 98 (pp. 803–806).Google Scholar
- Glasserman, P., & Yao, D. D. (1992). Some guidelines and guarantees for common random numbers. Management Science, 38, 884–908.MATHGoogle Scholar
- Greensmith, E., Bartlett, P. L., & Baxter, J. (2002). Variance reduction techniques for gradient estimates in reinforcement learning. In Advances in Neural Information Processing Systems 14 (pp. 1507–1514).Google Scholar
- He, Y., Fu, M. C., & Marcus, S. I. (2003). Convergence of simultaneous perturbation stochastic approximation for nondifferentiable optimization. IEEE Transactions on Automatic Control, 48, 1459–1463.MathSciNetCrossRefGoogle Scholar
- Igel, C., & Hüsken, M. (2000), Improving the Rprop learning algorithm. In H. Bothe, & R. Rojas (Eds.), Proceedings of the second international ICSC symposium on neural computation (NC 2000) (pp. 115–121). ICSC Academic Press.Google Scholar
- Igel, C., & Hüsken, M. (2003). Empirical evaluation of the improved Rprop learning algorithm. Neurocomputing, 50(C), 105–123.CrossRefMATHGoogle Scholar
- Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML 2002) (pp. 267–274).Google Scholar
- Kiefer, J., & Wolfowitz, J. (1952). Stochastic estimation of the maximum of a regression function. Annals of Mathematical Statistics, 23, 462–466.MathSciNetMATHGoogle Scholar
- Kleinman, N. L., Spall, J. C., & Neiman, D. Q. (1999). Simulation-based optimization with stochastic approximation using common random numbers. Management Science, 45(11), 1570–1578.MATHGoogle Scholar
- Kocsis, L. (2003). Learning search decisions. Ph.D. thesis, Universiteit Maastricht, The Netherlands.Google Scholar
- Kocsis, L., & Szepesvári, Cs. (2005). Reduced-variance payoff estimation in adversarial bandit problems. In Proceedings of the ECML’05 Workshop on Reinforcement Learning in Non-Stationary Environments (in print).Google Scholar
- Kocsis, L., Szepesvári, Cs., & Winands, M. H. M. (2005). RSPSA: Enhanced parameter optimisation in games. In Proceedings of the 11th Advances in Computer Games Conference (ACG-11), in press.Google Scholar
- Kushner, H. J., & Yin, G. G. (1997). Stochastic approximation algorithms and applications. New York: Springer.MATHGoogle Scholar
- L’Ecuyer, P., & Yin, G. (1998). Budget-dependent convergence rate of stochastic approximation. SIAM J. on Optimization, 8(1), 217–247.MathSciNetCrossRefMATHGoogle Scholar
- Polyak, B. T., & Tsybakov, A. B. (1990). Optimal orders of accuracy for search algorithms of stochastic optimization. Problems of Information Transmission, 26, 126–133.MathSciNetMATHGoogle Scholar
- Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning The RPROP algorithm. In E. H. Ruspini (Eds.), Proceedings of the IEEE international conference on neural networks (pp. 586–591). IEEE Press.Google Scholar
- Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22, 400–407.MathSciNetMATHGoogle Scholar
- Rubinstein, R. Y., Samorodnitsky, G., & Shaked, M. (1985). Antithetic variables, multivariate dependence and simulation of complex stochastic systems. Management Sciences, 31, 66–77.MathSciNetCrossRefMATHGoogle Scholar
- Sadegh, P. & Spall, J. C. (1997). Optimal random perturbations for stochastic approximation using a simultaneous perturbation gradient approximation. In Proceedings of the American Control Conference, Albuquerque, NM (pp. 3582–3586).Google Scholar
- Schraudolph, N. (1999). Local gain adaptation in stochastic gradient descent. In Proc. 9th International Conference on Artificial Neural Networks, Edinburgh (pp. 569–574). London: IEE.Google Scholar
- Schraudolph, N. N. & Graepel, T. (2002). Towards stochastic conjugate gradient methods. In Proceedings of the 9th International Conference on Neural Information Processing (pp. 1351–1358).Google Scholar
- Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37, 332–341.MATHMathSciNetCrossRefGoogle Scholar
- Spall, J. C. (2000). Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Transactions on Automatic Control, 45, 1839–1853.MATHMathSciNetCrossRefGoogle Scholar
- Spall, J. C. (2003). Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Hoboken, NJ: Wiley.MATHGoogle Scholar
- Sutton, R. & Barto, A. (1998). Reinforcement learning: An introduction. Bradford Book.Google Scholar
- Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.Google Scholar
- Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12 (pp. 1057–1063), MIT Press, Cambridge MA.Google Scholar
- Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, 8, 257–277.MATHGoogle Scholar
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.MATHGoogle Scholar
- Winands, M. H. M., Kocsis, L., Uiterwijk, J. W. H. M., & Van den Herik, H. J. (2002). Temporal difference learning and the neural movemap heuristic in the game of lines of action. In Proceedings of 3rd International Conference on Intelligent Games and Simulation (GAME-ON 2002) (pp. 99–103).Google Scholar
- Xiong, X., Wang, I.-J., & Fu, M. C. (2002). Randomized-direction stochastic approximation algorithms using deterministic sequences. In Proceedings of the 2002 Winter Simulation Conference, San Diego, CA (pp. 285–291).Google Scholar