Bulletin of Mathematical Biology

, Volume 71, Issue 8, pp 1818–1850 | Cite as

A Theoretical Analysis of Temporal Difference Learning in the Iterated Prisoner’s Dilemma Game

  • Naoki Masuda
  • Hisashi Ohtsuki
Original Article


Direct reciprocity is a chief mechanism of mutual cooperation in social dilemma. Agents cooperate if future interactions with the same opponents are highly likely. Direct reciprocity has been explored mostly by evolutionary game theory based on natural selection. Our daily experience tells, however, that real social agents including humans learn to cooperate based on experience. In this paper, we analyze a reinforcement learning model called temporal difference learning and study its performance in the iterated Prisoner’s Dilemma game. Temporal difference learning is unique among a variety of learning models in that it inherently aims at increasing future payoffs, not immediate ones. It also has a neural basis. We analytically and numerically show that learners with only two internal states properly learn to cooperate with retaliatory players and to defect against unconditional cooperators and defectors. Four-state learners are more capable of achieving a high payoff against various opponents. Moreover, we numerically show that four-state learners can learn to establish mutual cooperation for sufficiently small learning rates.


Cooperation Direct reciprocity Prisoner’s dilemma Reinforcement learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Axelrod, R., 1984. Evolution of Cooperation. Basic Books, New York. Google Scholar
  2. Camerer, C., Ho, T.-H., 1999. Experience-weighted attraction learning in normal form games. Econometrica 67, 827–874. zbMATHCrossRefGoogle Scholar
  3. Camerer, C.F., 2003. Behavioral Game Theory. Princeton University Press, New York. zbMATHGoogle Scholar
  4. Cheung, Y.-W., Friedman, D., 1997. Individual learning in normal form games: some laboratory results. Games Econ. Behav. 19, 46–76. zbMATHCrossRefMathSciNetGoogle Scholar
  5. Daw, N.D., Doya, K., 2006. The computational neurobiology of learning and reward. Curr. Opin. Neurobiol. 16, 199–204. CrossRefGoogle Scholar
  6. Erev, I., Roth, A.E., 1998. Predicting how people play games: reinforcement learning in experimental games with unique, mixed strategy equilibria. Am. Econ. Rev. 88, 848–881. Google Scholar
  7. Erev, I., Roth, A.E., 2001. On simple reinforcement learning models and reciprocation in the prisoner dilemma game. In: Gigerenzer, G., Selten, R. (Eds.), The Adaptive Toolbox, pp. 215–231. MIT Press, Cambridge Google Scholar
  8. Fudenberg, D., Levine, D.K., 1998. The Theory of Learning in Games. MIT Press, Cambridge. zbMATHGoogle Scholar
  9. Gutnisky, D.A., Zanutto, B.S., 2004. Cooperation in the iterated Prisoner’s Dilemma is learned by operant conditioning mechanisms. Artif. Life 10, 433–461. CrossRefGoogle Scholar
  10. Hauert, C., Stenull, O., 2002. Simple adaptive strategy wins the Prisoner’s Dilemma. J. Theor. Biol. 218, 261–272. CrossRefMathSciNetGoogle Scholar
  11. Kaelbling, L.P., Littman, M.L., Moore, A.W., 1996. Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285. Google Scholar
  12. Kraines, D., Kraines, V., 1993. Learning to cooperate with Pavlov. An adaptive strategy for the iterated Prisoner’s Dilemma with noise. Theory Decis. 35, 107–150. zbMATHCrossRefMathSciNetGoogle Scholar
  13. Macy, M.W., 1991. Learning to cooperate: stochastic and tacit collusion in social exchange. Am. J. Sociol. 97, 808–843. CrossRefGoogle Scholar
  14. Macy, M., 1996. Natural selection and social learning in Prisoner’s Dilemma. Sociol. Methods Res. 25, 103–137. CrossRefGoogle Scholar
  15. Macy, M.W., Flache, A., 2002. Learning dynamics in social dilemmas. Proc. Natl. Acad. Sci. USA 99(3), 7229–7236. CrossRefGoogle Scholar
  16. Montague, P.R., Berns, G.S., 2002. Neural economics and the biological substrates of valuation. Neuron 36, 265–284. CrossRefGoogle Scholar
  17. Montague, P.R., King-Casas, B., Cohen, J.D., 2006. Imaging valuation models in human choice. Annu. Rev. Neurosci. 29, 417–448. CrossRefGoogle Scholar
  18. Mookherjee, D., Sopher, B., 1994. Learning behavior in an experimental matching pennies game. Games Econ. Behav. 7, 62–91. zbMATHCrossRefMathSciNetGoogle Scholar
  19. Nowak, M., Sigmund, K., 1989. Game dynamical aspects of the Prisoner’s Dilemma. J. Appl. Math. Comput. 30, 191–213. zbMATHCrossRefMathSciNetGoogle Scholar
  20. Nowak, M., Sigmund, K., 1990. The evolution of stochastic strategies in the Prisoner’s Dilema. Acta Appl. Math. 20, 247–265. zbMATHCrossRefMathSciNetGoogle Scholar
  21. Nowak, M.A., 2006. Five rules for the evolution of cooperation. Science 314, 1560–1563. CrossRefGoogle Scholar
  22. Nowak, M.A., Sigmund, K., 1992. Tit for tat in heterogeneous populations. Nature 355, 250–253. CrossRefGoogle Scholar
  23. Nowak, M.A., Sigmund, K., 1993. A strategy of win-stay lose-shift that outperforms tit-for-tat in the Prisoner’s Dilemma game. Nature 364, 56–58. CrossRefGoogle Scholar
  24. Nowak, M.A., Sigmund, K., El-Sedy, E., 1995. Automata, repeated games and noise. J. Math. Biol. 33, 703–722. zbMATHCrossRefMathSciNetGoogle Scholar
  25. Ohtsuki, H., 2004. Reactive strategies in indirect reciprocity. J. Theor. Biol. 227, 299–314. CrossRefMathSciNetGoogle Scholar
  26. Posch, M., Pichler, A., Sigmund, K., 1999. The efficiency of adapting aspiration levels. Proc. R. Soc. Lond. B 266, 1427–1435. CrossRefGoogle Scholar
  27. Rapoport, A., Chammah, A.M., 1965. Prisoner’ s Dilemma: A Study in Conflict and Cooperation. University of Michigan Press, Ann Arbor. Google Scholar
  28. Roth, A.E., Erev, I., 1995. Learning in extensive-form games: experimental data and simple dynamic models in the intermediate term. Games Econ. Behav. 8, 164–212. zbMATHCrossRefMathSciNetGoogle Scholar
  29. Samuel, A.L., 1959. Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 3, 210–229. MathSciNetCrossRefGoogle Scholar
  30. Sandholm, T.W., Crites, R.H., 1996. Multiagent reinforcement learning in the Iterated Prisoner’s Dilemma. BioSystems 37, 147–166. CrossRefGoogle Scholar
  31. Sato, Y., Akiyama, E., Farmer, J.D., 2002. Chaos in learning a simple two-person game. Proc. Natl. Acad. Sci. USA 99, 4748–4751. zbMATHCrossRefMathSciNetGoogle Scholar
  32. Schultz, W., Dayan, P., Montague, P.R., 1997. A neural substrate of prediction and reward. Science 275, 1593–1599. CrossRefGoogle Scholar
  33. Singh, S.P., Jaakkola, T., Jordan, M.L., 1994. Learning without state-estimation in partially observable Markovian decision processes. In: Proc. the Eleventh Machine Learning Conference Google Scholar
  34. Singh, S., Jaakkola, T., Littman, M.L., Szepesvári, C., 2000. Convergence results for single-step on-policy reinforcement algorithms. Mach. Learn. 39, 287–308. CrossRefGoogle Scholar
  35. Sutton, R.S., Barto, A.G., 1998. Reinforcement Learning. MIT Press, Cambridge. Google Scholar
  36. Taiji, M., Ikegami, T., 1999. Dynamics of internal models in game players. Physica D 134, 253–266. zbMATHCrossRefMathSciNetGoogle Scholar
  37. Tesauro, G., 1992. Practical issues in temporal difference learning. Mach. Learn. 8, 257–277. zbMATHGoogle Scholar
  38. Trivers, R., 1971. The evolution of reciprocal altruism. Q. Rev. Biol. 46, 35–57. CrossRefGoogle Scholar
  39. Watkins, C.J.C.H., Dayan, P., 1992. Q-learning. Mach. Learn. 8, 279–292. zbMATHGoogle Scholar

Copyright information

© Society for Mathematical Biology 2009

Authors and Affiliations

  1. 1.Graduate School of Information Science and TechnologyThe University of TokyoBunkyoJapan
  2. 2.Department of Value and Decision ScienceTokyo Institute of TechnologyTokyoJapan

Personalised recommendations