A Theoretical Analysis of Temporal Difference Learning in the Iterated Prisoner’s Dilemma Game

Masuda, Naoki; Ohtsuki, Hisashi

doi:10.1007/s11538-009-9424-8

A Theoretical Analysis of Temporal Difference Learning in the Iterated Prisoner’s Dilemma Game

Original Article
Published: 29 May 2009

Volume 71, pages 1818–1850, (2009)
Cite this article

Bulletin of Mathematical Biology Aims and scope Submit manuscript

Naoki Masuda¹ &
Hisashi Ohtsuki²

274 Accesses
14 Citations
Explore all metrics

Abstract

Direct reciprocity is a chief mechanism of mutual cooperation in social dilemma. Agents cooperate if future interactions with the same opponents are highly likely. Direct reciprocity has been explored mostly by evolutionary game theory based on natural selection. Our daily experience tells, however, that real social agents including humans learn to cooperate based on experience. In this paper, we analyze a reinforcement learning model called temporal difference learning and study its performance in the iterated Prisoner’s Dilemma game. Temporal difference learning is unique among a variety of learning models in that it inherently aims at increasing future payoffs, not immediate ones. It also has a neural basis. We analytically and numerically show that learners with only two internal states properly learn to cooperate with retaliatory players and to defect against unconditional cooperators and defectors. Four-state learners are more capable of achieving a high payoff against various opponents. Moreover, we numerically show that four-state learners can learn to establish mutual cooperation for sufficiently small learning rates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Intrinsic fluctuations of reinforcement learning promote cooperation

Article Open access 24 January 2023

Lévy noise promotes cooperation in the prisoner’s dilemma game with reinforcement learning

Article 08 March 2022

Comparing reactive and memory-one strategies of direct reciprocity

Article Open access 10 May 2016

References

Axelrod, R., 1984. Evolution of Cooperation. Basic Books, New York.
Google Scholar
Camerer, C., Ho, T.-H., 1999. Experience-weighted attraction learning in normal form games. Econometrica 67, 827–874.
Article MATH Google Scholar
Camerer, C.F., 2003. Behavioral Game Theory. Princeton University Press, New York.
MATH Google Scholar
Cheung, Y.-W., Friedman, D., 1997. Individual learning in normal form games: some laboratory results. Games Econ. Behav. 19, 46–76.
Article MATH MathSciNet Google Scholar
Daw, N.D., Doya, K., 2006. The computational neurobiology of learning and reward. Curr. Opin. Neurobiol. 16, 199–204.
Article Google Scholar
Erev, I., Roth, A.E., 1998. Predicting how people play games: reinforcement learning in experimental games with unique, mixed strategy equilibria. Am. Econ. Rev. 88, 848–881.
Google Scholar
Erev, I., Roth, A.E., 2001. On simple reinforcement learning models and reciprocation in the prisoner dilemma game. In: Gigerenzer, G., Selten, R. (Eds.), The Adaptive Toolbox, pp. 215–231. MIT Press, Cambridge
Google Scholar
Fudenberg, D., Levine, D.K., 1998. The Theory of Learning in Games. MIT Press, Cambridge.
MATH Google Scholar
Gutnisky, D.A., Zanutto, B.S., 2004. Cooperation in the iterated Prisoner’s Dilemma is learned by operant conditioning mechanisms. Artif. Life 10, 433–461.
Article Google Scholar
Hauert, C., Stenull, O., 2002. Simple adaptive strategy wins the Prisoner’s Dilemma. J. Theor. Biol. 218, 261–272.
Article MathSciNet Google Scholar
Kaelbling, L.P., Littman, M.L., Moore, A.W., 1996. Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285.
Google Scholar
Kraines, D., Kraines, V., 1993. Learning to cooperate with Pavlov. An adaptive strategy for the iterated Prisoner’s Dilemma with noise. Theory Decis. 35, 107–150.
Article MATH MathSciNet Google Scholar
Macy, M.W., 1991. Learning to cooperate: stochastic and tacit collusion in social exchange. Am. J. Sociol. 97, 808–843.
Article Google Scholar
Macy, M., 1996. Natural selection and social learning in Prisoner’s Dilemma. Sociol. Methods Res. 25, 103–137.
Article Google Scholar
Macy, M.W., Flache, A., 2002. Learning dynamics in social dilemmas. Proc. Natl. Acad. Sci. USA 99(3), 7229–7236.
Article Google Scholar
Montague, P.R., Berns, G.S., 2002. Neural economics and the biological substrates of valuation. Neuron 36, 265–284.
Article Google Scholar
Montague, P.R., King-Casas, B., Cohen, J.D., 2006. Imaging valuation models in human choice. Annu. Rev. Neurosci. 29, 417–448.
Article Google Scholar
Mookherjee, D., Sopher, B., 1994. Learning behavior in an experimental matching pennies game. Games Econ. Behav. 7, 62–91.
Article MATH MathSciNet Google Scholar
Nowak, M., Sigmund, K., 1989. Game dynamical aspects of the Prisoner’s Dilemma. J. Appl. Math. Comput. 30, 191–213.
Article MATH MathSciNet Google Scholar
Nowak, M., Sigmund, K., 1990. The evolution of stochastic strategies in the Prisoner’s Dilema. Acta Appl. Math. 20, 247–265.
Article MATH MathSciNet Google Scholar
Nowak, M.A., 2006. Five rules for the evolution of cooperation. Science 314, 1560–1563.
Article Google Scholar
Nowak, M.A., Sigmund, K., 1992. Tit for tat in heterogeneous populations. Nature 355, 250–253.
Article Google Scholar
Nowak, M.A., Sigmund, K., 1993. A strategy of win-stay lose-shift that outperforms tit-for-tat in the Prisoner’s Dilemma game. Nature 364, 56–58.
Article Google Scholar
Nowak, M.A., Sigmund, K., El-Sedy, E., 1995. Automata, repeated games and noise. J. Math. Biol. 33, 703–722.
Article MATH MathSciNet Google Scholar
Ohtsuki, H., 2004. Reactive strategies in indirect reciprocity. J. Theor. Biol. 227, 299–314.
Article MathSciNet Google Scholar
Posch, M., Pichler, A., Sigmund, K., 1999. The efficiency of adapting aspiration levels. Proc. R. Soc. Lond. B 266, 1427–1435.
Article Google Scholar
Rapoport, A., Chammah, A.M., 1965. Prisoner’ s Dilemma: A Study in Conflict and Cooperation. University of Michigan Press, Ann Arbor.
Google Scholar
Roth, A.E., Erev, I., 1995. Learning in extensive-form games: experimental data and simple dynamic models in the intermediate term. Games Econ. Behav. 8, 164–212.
Article MATH MathSciNet Google Scholar
Samuel, A.L., 1959. Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 3, 210–229.
Article MathSciNet Google Scholar
Sandholm, T.W., Crites, R.H., 1996. Multiagent reinforcement learning in the Iterated Prisoner’s Dilemma. BioSystems 37, 147–166.
Article Google Scholar
Sato, Y., Akiyama, E., Farmer, J.D., 2002. Chaos in learning a simple two-person game. Proc. Natl. Acad. Sci. USA 99, 4748–4751.
Article MATH MathSciNet Google Scholar
Schultz, W., Dayan, P., Montague, P.R., 1997. A neural substrate of prediction and reward. Science 275, 1593–1599.
Article Google Scholar
Singh, S.P., Jaakkola, T., Jordan, M.L., 1994. Learning without state-estimation in partially observable Markovian decision processes. In: Proc. the Eleventh Machine Learning Conference
Singh, S., Jaakkola, T., Littman, M.L., Szepesvári, C., 2000. Convergence results for single-step on-policy reinforcement algorithms. Mach. Learn. 39, 287–308.
Article Google Scholar
Sutton, R.S., Barto, A.G., 1998. Reinforcement Learning. MIT Press, Cambridge.
Google Scholar
Taiji, M., Ikegami, T., 1999. Dynamics of internal models in game players. Physica D 134, 253–266.
Article MATH MathSciNet Google Scholar
Tesauro, G., 1992. Practical issues in temporal difference learning. Mach. Learn. 8, 257–277.
MATH Google Scholar
Trivers, R., 1971. The evolution of reciprocal altruism. Q. Rev. Biol. 46, 35–57.
Article Google Scholar
Watkins, C.J.C.H., Dayan, P., 1992. Q-learning. Mach. Learn. 8, 279–292.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo, 113-8656, Japan
Naoki Masuda
Department of Value and Decision Science, Tokyo Institute of Technology, Tokyo, 152-8552, Japan
Hisashi Ohtsuki

Authors

Naoki Masuda
View author publications
You can also search for this author in PubMed Google Scholar
Hisashi Ohtsuki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Naoki Masuda.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Masuda, N., Ohtsuki, H. A Theoretical Analysis of Temporal Difference Learning in the Iterated Prisoner’s Dilemma Game. Bull. Math. Biol. 71, 1818–1850 (2009). https://doi.org/10.1007/s11538-009-9424-8

Download citation

Received: 30 August 2008
Accepted: 08 April 2009
Published: 29 May 2009
Issue Date: November 2009
DOI: https://doi.org/10.1007/s11538-009-9424-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Theoretical Analysis of Temporal Difference Learning in the Iterated Prisoner’s Dilemma Game

Abstract

Access this article

Similar content being viewed by others

Intrinsic fluctuations of reinforcement learning promote cooperation

Lévy noise promotes cooperation in the prisoner’s dilemma game with reinforcement learning

Comparing reactive and memory-one strategies of direct reciprocity

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Theoretical Analysis of Temporal Difference Learning in the Iterated Prisoner’s Dilemma Game

Abstract

Access this article

Similar content being viewed by others

Intrinsic fluctuations of reinforcement learning promote cooperation

Lévy noise promotes cooperation in the prisoner’s dilemma game with reinforcement learning

Comparing reactive and memory-one strategies of direct reciprocity

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation