Abstract
Temporal difference and evolutionary methods are two of the most common approaches to solving reinforcement learning problems. However, there is little consensus on their relative merits and there have been few empirical studies that directly compare their performance. This article aims to address this shortcoming by presenting results of empirical comparisons between Sarsa and NEAT, two representative methods, in mountain car and keepaway, two benchmark reinforcement learning tasks. In each task, the methods are evaluated in combination with both linear and nonlinear representations to determine their best configurations. In addition, this article tests two specific hypotheses about the critical factors contributing to these methods’ relative performance: (1) that sensor noise reduces the final performance of Sarsa more than that of NEAT, because Sarsa’s learning updates are not reliable in the absence of the Markov property and (2) that stochasticity, by introducing noise in fitness estimates, reduces the learning speed of NEAT more than that of Sarsa. Experiments in variations of mountain car and keepaway designed to isolate these factors confirm both these hypotheses.
Article PDF
Similar content being viewed by others
Explore related subjects
Find the latest articles, discoveries, and news in related topics.Avoid common mistakes on your manuscript.
References
Albus J. S. (1981) Brains, behavior, and robotics. Byte Books, Peterborough, NH
Anderson, C. W. (1986). Learning and problem solving with multilayer connectionist systems. Ph.D. thesis, University of Massachusetts, Amherst, MA.
Baird, L., & Moore, A. (1999). Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems (Vol. 11). Cambridge, MA: MIT Press.
Bakker, B. (2002). Reinforcement learning with long short-term memory. In Advances in Neural Information Processing Systems (Vol. 14, pp. 1475–1482).
Barto, A., & Duff, M. (1994). Monte Carlo matrix inversion and reinforcement learning. In Advances in Neural Information Processing Systems (Vol. 6, pp. 687–694).
Barto A. G., Sutton R. S., Anderson C. W. (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics SMC-13(5): 834–846
Baxter J., Bartlett P. L. (2001) Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15: 319–350
Beielstein, T., & Markon, S. (2002). Threshold selection, hypothesis tests and DOE methods. 2002 Congresss on evolutionary computation (pp. 777–782).
Bellman R. E. (1956) A problem in the sequential design of experiments. Sankhya 16: 221–229
Bellman R. E. (1957) Dynamic programming. Princeton University Press, Princeton
Beyer, H.-G., & Sendhoff, B. (2007). Evolutionary algorithms in the presence of noise: To sample or not to sample. In Proceedings of the 1st IEEE Symposium on Foundations of Computational Intelligence (pp. 17–24).
Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems (Vol. 7).
Bradtke, S. J., & Duff, M. O. (1995). Reinforcement learning methods for continuous-time Markov decision problems. In Advances in Neural Information Processing Systems (Vol. 7, pp. 393–400).
Brafman R. I., Tennenholtz M. (2002) R-MAX—a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3: 213–231
Crites R. H., Barto A. G. (1998) Elevator group control using multiple reinforcement learning agents. Machine Learning 33(2-3): 235–262
Darwen, P. J. (2001). Why co-evolution beats temporal difference learning at backgammon for a linear architecture, but not a non-linear architecture. In Proceedings of the 2001 Congress on Evolutionary Computation (pp. 1003–1010).
Gauci, J. J., & Stanley, K. O. (2007). Generating large-scale neural networks through discovering geometric regularities. In Proceedings of the Genetic and Evolutionary Computation Conference.
Goldberg D. E. (1989) Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Boston, MA
Gomez, F., & Miikkulainen, R. (1999). Solving non-Markovian control tasks with neuroevolution. In Proceedings of the International Joint Conference on Artificial Intelligence (pp. 1356–1361).
Gomez, F., & Schmidhuber, J. (2005). Co-evolving recurrent neurons learn deep memory pomdps. In GECCO-05: Proceedings of the Genetic and Evolutionary Computation Conference (pp. 491–498).
Gomez, F., Schmidhuber, J., & Miikkulainen, R. (2006). Efficient non-linear control through neuroevolution. In Proceedings of the European Conference on Machine Learning.
Gruau, F., Whitley, D., & Pyeatt, L. (1996). A comparison between cellular encoding and direct encoding for genetic neural networks. In Genetic Programming 1996: Proceedings of the 1st Annual Conference (pp. 81–89).
Heidrich-Meisner, V., & Igel, C. (2008a). Evolution strategies for direct policy search. In Proceedings of the 10th International Conference on Parallel Problem Solving from Nature (pp. 428–437). Berlin, Heidelberg: Springer.
Heidrich-Meisner, V., & Igel, C. (2008b). Similarities and differences between policy gradient methods and evolution strategies. In Proceedings of the 16th European Symposium on Artificial Neural Networks (ESANN).
Heidrich-Meisner, V., & Igel, C. (2008c). Variable metric reinforcement learning methods applied to the noisy mountain car problem. In Recent Advances in Reinforcement Learning: 8th European Workshop (pp. 136–150). Berlin, Heidelberg: Springer.
Jong, N. K., & Stone, P. (2007). Model-based exploration in continuous state spaces. In The 7th Symposium on Abstraction, Reformulation, and Approximation.
Kakade, S. (2003). On the sample complexity of reinforcement learning. Ph.D. thesis, University College London, London, UK.
Kalyanakrishnan, S., & Stone, P. (2009). An empirical analysis of value function-based and policy search reinforcement learning. In Proceedings of the 8th International Joint Conference on Autonomous Agents and Multi-Agent Systems (AAMAS 2009).
Kassahun, Y., & Sommer, G. (2005). Automatic neural robot controller design using evolutionary acquisition of neural topologies. In Fachgespräch Autonome Mobile Systeme (AMS 2005), Stuttgart, Germany, 8, 9.12.05, Informatik aktuell (Vol. 19, pp. 315–321). Springer.
Kearns M., Singh S. (2002) Near-optimal reinforcement learning in polynomial time. Machine Learning 49(2): 209–232
Keller, P., Mannor, S., & Precup, D.(2006). Automatic basis function construction for approximate dynamic programming and reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning (pp. 449–456).
Kohl, N., & Miikkulainen, R. (2008). Evolving neural networks for fractured domains. In Proceedings of the Genetic and Evolutionary Computation Conference (pp. 1405–1412).
Kohl N., Miikkulainen R. (2009) Evolving neural networks for strategic decision-making problems. Neural Networks, Special Issue on Goal-Directed Neural Systems 22(3): 326–337
Kohl, M., & Stone, P. (2004). Policy gradient reinforcement learning for fast quadrupedal locomotion. In Proceedings of the IEEE International Conference on Robotics and Automation (pp. 2619–2624).
Kretchmar, R. M., & Anderson, C. W. (1997). Comparison of CMACs and radial basis functions for local function approximators in reinforcement learning. In International Conference on Neural Networks.
Lagoudakis M. G., Parr R. (2003) Least-squares policy iteration. Journal of Machine Learning Research 4: 1107–1149
Littman, M. L., Dean, T. L., & Kaelbling, L. P. (1995). On the complexity of solving Markov decision processes. In Proceedings of the 11th International Conference on Uncertainty in Artificial Intelligence (pp. 394–402).
Lucas, S. M., & Runarsson, T. P. (2006). Temporal difference learning versus co-evolution for acquiring Othello position evaluation. In IEEE Symposium on Computational Intelligence and Games.
Lucas, S. M., & Togelius, J. (2007). Point-to-point car racing: An initial study of evolution versus temporal difference learning. In IEEE Symposium on Computational Intelligence and Games (pp. 260–267).
Mahadevan, S. (2005). Samuel meets Amarel: Automating value function approximation using global state space analysis. In Proceedings of the 20th National Conference on Artificial Intelligence.
Mannor, S., Rubenstein, R., & Gat, Y. (2003). The cross-entropy method for fast policy search. In Proceedings of the 20th International Conference on Machine Learning (pp. 512–519).
Menache I., Mannor S., Shimkin N. (2005) Basis function adaptation in temporal difference reinforcement earning. Annals of Operations Research 134: 215–238
Metzen, J. H., Edgington, M., Kassahun, Y., & Kirchner, F. (2008). Analysis of an evolutionary reinforcement learning method in a multiagent domain. In Proceedings of the 7th International Joint Conference on Autonomous Agents and Multi-Agent Systems (AAMAS 2008) (pp. 291–298). Estoril, Portugal.
Moore A., Atkeson C. (1993) Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning 13: 103–130
Moriarty D. E., Miikkulainen R. (1996) Efficient reinforcement learning through symbiotic evolution. Machine Learning 22(11): 11–33
Moriarty D. E., Schultz A. C., Grefenstette J. J. (1999) Evolutionary algorithms for reinforcement learning. Journal of Artificial Intelligence Research 11: 99–229
Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., et al. (2004). Inverted autonomous helicopter flight via reinforcement learning. In Proceedings of the International Symposium on Experimental Robotics.
Noda I., Matsubara H., Hiraki K., Frank I. (1998) Soccer server: A tool for research on multiagent systems. Applied Artificial Intelligence 12: 233–250
Pollack J., Blair A. (1998) Co-evolution in the successful learning of backgammon strategy. Machine Learning 32: 225–240
Potter M. A., Jong K. A. D. (2000) Cooperative coevolution: An architecture for evolving coadapted subcomponents. Evolutionary Computation 8: 1–29
Powell M. (1987) Radial basis functions for multivariate interpolation: A review algorithms for approximation. Clarendon Press, Oxford
Pyeatt, L. D., & Howe, A. E. (2001). Decision tree function approximation in reinforcement learning. In Proceedings of the 3rd International Symposium on Adaptive Systems: Evolutionary computation and probabilistic graphical models (pp. 70–77).
Radcliffe N. J. (1993) Genetic set recombination and its application to neural network topology optimization. Neural Computing and Applications 1(1): 67–90
Reidmiller, M. (2005). Neural fitted Q iteration—first experiences with a data efficient neural reinforcement learning method. In Proceedings of the 16th European Conference on Machine Learning (pp. 317–328).
Rummery, G., & Niranjan, M. (1994). On-line Q-learning using connectionist systems CUED/F-INFENG/TR 166. Cambridge University.
Runarsson T. P., Lucas S. M. (2005) Co-evolution versus self-play temporal difference learning for acquiring position evaluation in small-board Go. IEEE Transactions on Evolutionary Computation 9: 628–640
Saravanan N., Fogel D. B. (1995) Evolving neural control systems. IEEE Expert: Intelligent Systems and Their Applications 10(3): 23–27
Smart, W. D., & Kaelbling, L. P. (2000). Practical reinforcement learning in continuous spaces. In Proceedings of the 17th International Conference on Machine Learning (pp. 903–910).
Stagge P. (1998) Averaging efficiently in the presence of noise. Parallel Problem Solving from Nature 5: 188–197
Stanley K. O., Miikkulainen R. (2002) Evolving neural networks through augmenting topologies. Evolutionary Computation 10(2): 99–127
Stanley K. O., Miikkulainen R. (2004) Competitive coevolution through evolutionary complexification. Journal of Artificial Intelligence Research 21: 63–100
Stone P. (2000) Layered learning in multiagent systems: A winning approach to robotic soccer. MIT Press, Cambridge, MA
Stone, P., Kuhlmann, G., Taylor, M. E., & Liu, Y. (2005a). Keepaway soccer: From machine learning testbed to benchmark. In RoboCup-2005: Robot Soccer World Cup IX (Vol. 4020, pp. 93–105). Berlin: Springer.
Stone P., Sutton R. S., Kuhlmann G. (2005) Learning in RoboCup-soccer keepaway. Adaptive Behavior 13(3): 165–188
Strehl, A., & Littman, M. (2005). A theoretical analysis of model-based interval estimation. In Proceedings of the 22nd International Conference on Machine Learning (pp. 856–863).
Sutton, R. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems (Vol. 8, pp. 1038–1044).
Sutton R. S. (1988) Learning to predict by the methods of temporal differences. Machine Learning 3: 9–44
Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Conference on Machine Learning (pp. 216–224).
Sutton R. S., Barto A. G. (1998) Reinforcement learning: An introduction. MIT Press, Cambridge, MA
Sutton, R., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (pp. 1057–1063).
Szita I., Lörincz A. (2006) Learning Tetris using the noisy cross-entropy method. Neural Computation 18(12): 2936–2941
Taylor, M. E., Whiteson, S., & Stone, P. (2006). Comparing evolutionary and temporal difference methods in a reinforcement learning domain. In GECCO 2006: Proceedings of the Genetic and Evolutionary Computation Conference (pp. 1321–1328).
Tesauro G. (1994) TD-gammon, a self-teaching backgammon program achieves master-level play. Neural Computation 6: 215–219
Tesauro G. (1998) Comments on “co-evolution in the successful learning of backgammon strategy”. Machine Learning 32(3): 241–243
Tesauro, G., Das, N. K. J. R., & Bennania, M. N. (2006). A hybrid reinforcement learning approach to autonomic resource allocation. In Proceedings of the 3rd International Conference on Autonomic Computing.
Watkins C., Dayan P. (1992) Q-learning. Machine Learning 8(3-4): 9–44
Weiland, A. (1991). Evolving neural network controllers for unstable systems. In International Joint Conference on Neural Networks (pp. 667–673).
Whiteson S., Kohl N., Miikkulainen R., Stone P. (2005) Evolving keepaway soccer players through task decomposition. Machine Learning 59(1): 5–30
Whiteson S., Stone P. (2006) Evolutionary function approximation for reinforcement learning. Journal of Machine Learning Research 7: 877–917
Whitley D., Dominic S., Das R., Anderson C. W. (1993) Genetic reinforcement learning for neurocontrol problems. Machine Learning 13: 259–284
Whitley, D., & Kauth, K. (1988). GENITOR: A different genetic algorithm. In Proceedings of the 1988 Rocky Mountain Conference on Artificial Intelligence (pp. 118–130).
Yao X. (1999) Evolving artificial neural networks. Proceedings of the IEEE 87(9): 1423–1447
We would like to thank Ken Stanley for help setting up NEAT in keepaway, as well as Shivaram Kalyanakrishnan, Nate Kohl, Frans Oliehoek, David Pardoe, Jefferson Provost, Joseph Reisinger, Ken Stanley, and the anonymous reviewers for helpful comments and suggestions. This research was supported in part by NSF CAREER award IIS-0237699 and NSF award EIA-0303609.
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper significantly extends an earlier conference paper, presented at the 2006 GECCO conference [72].
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Whiteson, S., Taylor, M.E. & Stone, P. Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning. Auton Agent Multi-Agent Syst 21, 1–35 (2010). https://doi.org/10.1007/s10458-009-9100-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10458-009-9100-2