## Abstract

One approach to achieving cooperation in the one-shot prisoner’s dilemma is Tennenholtz’s (Games Econ Behav 49(2):363–373, 2004) program equilibrium, in which the players of a game submit programs instead of strategies. These programs are then allowed to read each other’s source code to decide which action to take. As shown by Tennenholtz, cooperation is played in an equilibrium of this alternative game. In particular, he proposes that the two players submit the same version of the following program: cooperate if the opponent is an exact copy of this program and defect otherwise. Neither of the two players can benefit from submitting a different program. Unfortunately, this equilibrium is fragile and unlikely to be realized in practice. We thus propose a new, simple program to achieve more robust cooperative program equilibria: cooperate with some small probability \(\epsilon \) and otherwise act as the opponent acts against this program. I argue that this program is similar to the tit for tat strategy for the iterated prisoner’s dilemma. Both “start” by cooperating and copy their opponent’s behavior from “the last round”. We then generalize this approach of turning strategies for the repeated version of a game into programs for the one-shot version of a game to other two-player games. We prove that the resulting programs inherit properties of the underlying strategy. This enables them to robustly and effectively elicit the same responses as the underlying strategy for the repeated game.

## Keywords

Algorithmic game theory Program equilibrium Nash equilibrium Repeated games## 1 Introduction

Much has been written about rationalizing non-Nash equilibrium play in strategic-form games. Most prominently, game theorists have discussed how cooperation may be achieved in the prisoner’s dilemma, where mutual cooperation is not a Nash equilibrium but Pareto-superior to mutual defection. One of the most successful approaches is the repetition of a game, and in particular the iterated prisoner’s dilemma (Axelrod 2006).

Another approach is to introduce commitment mechanisms of some sort. In this paper, we will discuss one particular commitment mechanism: Tennenholtz’s (2004) program equilibrium formalism (Sect. 2.2). Here, the idea is that in place of strategies, players submit programs which compute strategies and are given access to each other’s source code. The programs can then encode credible commitments, such as some version of “if you cooperate, I will cooperate”.

As desired, Tennenholtz (2004, Sect. 3, Theorem 1) shows that mutual cooperation is played in a program equilibrium of the prisoner’s dilemma. However, Tennenholtz’ equilibrium is very fragile. Essentially, it consists of two copies of a program that cooperates if it faces an exact copy of itself (cf. McAfee 1984; Howard 1988). Even small deviations from that program break the equilibrium. Thus, achieving cooperation in this way is only realistic if the players can communicate beforehand and settle on a particular outcome.

Another persuasive critique of this trivial equilibrium is that the model of two players submitting programs is only a metaphor, anyway. In real life, the programs may instead be the result of an evolutionary process (Binmore 1988 pp. 14f.) and Tennenholtz’ equilibrium is a hard to obtain by such a process. Alternatively, if we view our theory as normative rather than descriptive, we may view the programs themselves as the target audience of our recommendations. This also means that these agents will already have some form of source code—e.g., one that derives and considers the program equilibria of the game—and it is out of their realm of power to change that source code to match some common standard. However, they may still decide on some procedure for thinking about this particular problem in such a way that enables cooperation with other rationally pre-programmed agents.

Noting the fragility of Tennenholtz’ proposed equilibrium, it has been proposed to achieve a more robust program equilibrium by letting the programs reason about each other (van der Hoek et al. 2013; Barasz et al. 2014; Critch 2016). For example, Barasz et al. (2014, Sect. 3) propose a program FairBot—variations of which we will see in this paper—that cooperates if Peano arithmetic can prove that the opponent cooperates against FairBot. FairBot cooperates (via Löb’s theorem) more robustly against different versions of itself. These proposals are very elegant and certainly deserve further attention. However, their benefits come at the cost of being computationally expensive.

In this paper, I thus derive a class of program equilibria that I will argue to be more practical. In the case of the prisoner’s dilemma, I propose a program that cooperates with a small probability and otherwise acts as the opponent acts against itself (see Algorithm 1). Doing what the opponent does—à la FairBot—incentivizes cooperation. Cooperating with a small probability allows us to avoid infinite loops that would arise if we merely predicted and copied our opponent’s action (see Algorithm 2). This approach to a robust cooperation program equilibrium in the prisoner’s dilemma is described in Sect. 3.

We then go on to generalize the construction exemplified in the prisoner’s dilemma (see Sect. 4). In particular, we show how strategies for the repeated version of a game can be used to construct good programs for the one-shot version of that game. We show that many of the properties of the underlying strategy of the repeated game carry over to the program for the stage game. We can thus construct “good” programs and program equilibria from “good” strategies and Nash equilibria.

## 2 Preliminaries

### 2.1 Strategic-form games

For reference, we begin by introducing some basic terminology and formalism for strategic-form games. For an introduction, see, e.g., Osborne (2004). For reasons that will become apparent later on, we limit our treatment to two-player games.

A *two-player strategic game* \(G=(A_1,A_2,u_1,u_2)\) consists of two countable sets of moves \(A_i\) and for both players \(i\in \{1,2\}\) a bounded utility function \(u_i:A_1\times A_2 \rightarrow {\mathbb {R}}\). A *(mixed) strategy* for player *i* is a probability distribution \(\pi _i\) over \(A_i\).

*strategy profile*\((\pi _1,\pi _2)\) the probability of an

*outcome*\((a_1,a_2)\in A_1\times A_2\) is

*i*given that strategy profile is

### 2.2 Program equilibrium

We now introduce the concept of program equilibrium, first proposed by Tennenholtz (2004). The main idea is to replace strategies with computer programs that are given access to each other’s source code.^{1} The programs then give rise to strategies.

For any game *G*, we first need to define the set of *program profiles* \( PROG (G)\) consisting of pairs of programs. The *i*th entry of an element of \( PROG (G)\) must be a program source code \(p_i\) that, when interpreted by a function \( apply \), probabilistically map program profiles^{2} onto \(A_i\).

We require that for any program profile \((p_1,p_2)\in PROG (G)\), both programs halt. Otherwise, the profile would not give rise to a well-defined strategy. Whether \(p_i\) halts depends on the program \(p_{-i}\), it plays against, where (in accordance with convention in game theory) \(-:\{1,2\}\rightarrow \{1,2\}:1\mapsto 2, 2 \mapsto 1\) and we write \(-i\) instead of \(-(i)\). For example, if \(p_{i}\) runs \( apply (p_{-i},(p_i, p_{-i}))\), i.e., simulates the opponent, then that is fine as long as \(p_{-i}\) does not also run \(apply(p_i,(p_i,p_{-i}))\), which would yield an infinite loop. To avoid this mutual dependence, we will generally require that \( PROG (G)= PROG _1(G)\times PROG _2(G)\), where \( PROG _i(G)\) consists of programs for player *i*. Methods of doing this while maintaining expressive power include hierarchies of players—e.g., higher indexed players are allowed to simulate lower indexed ones but not vice versa—hierarchies of programs—programs can only call their opponents with simpler programs as input—requiring programs to have a “plan B” if termination can otherwise not be guaranteed, or allowing each player to only start strictly less than one simulation in expectation. These methods may also be combined. In this paper, we do not assume any particular definition of \( PROG (G)\). However, we assume that they can perform arbitrary computations as long as these computations are guaranteed to halt regardless of the output of the parts of the code that do depend on the opponent program. We also require that \( PROG (G)\) is compatible with our constructions. We will show our constructions to be so benign in terms of infinite loops that this is not too strong of an assumption.

*G*, we define

*i*, we define the (set-valued) best response function as

*(weak) program equilibrium of*

*G*if for \(i\in \{1,2 \}\) it is \(p_i\in B_i(p_{-i})\).

### 2.3 Repeated games

*G*, we define \(G_{\epsilon }\) to be the repetition of

*G*with a probability of \(\epsilon \in \left( 0,1\right] \) of ending after each round. Both players of \(G_{\epsilon }\) will be informed only of the last move of their opponent. This differs from the more typical assumption that players have access to the entire history of past moves. We will later see why this deviation is necessary. A strategy \(\pi _i\) for player

*i*non-deterministically maps opponent moves or the information of the lack thereof onto a move

*a*given that the opponent played

*b*in the previous round and \(\pi _i(0,a):=\pi _i(0)(a)\) denotes the probability of choosing

*a*in the first round. We call a strategy \(\pi _i\)

*stationary*if for all \(a\in A_i\), \(\pi _i(b,a)\) is constant with respect to \(b\in \{ 0 \} \cup A_{-i}\). If \(\pi _i\) is stationary, we write \(\pi _i(a):=\pi _i(b,a)\). The probability that the game follows a complete history of moves \(h=a_0b_0a_1b_1\cdots a_nb_n\) and then ends is

*i*given the strategy profile \((\pi _1,\pi _2)\) is

*i*, we define the set-valued best response function as

*(weak) Nash equilibrium of*\(G_\epsilon \) if for \(i\in \{1,2 \}\) it is \(\pi _i\in B_i(\pi _{-i})\).

We now prove a few lemmas that we will need later on. First, we have suggestively called the values *P*(*h*) probabilities, but we have not shown them to satisfy, say, Kolmogorov’s axioms. Additivity is not an issue, because we have only defined the probability for atomic events and non-negativity is obvious from the definition. However, we will also need the fact that the numbers we have called probabilities indeed sum to 1, which requires a few lines to prove.

### Lemma 1

### Proof

*k*and that (

*a*,

*b*) is played in that round. With this

*k*th round (where not getting to the

*k*th round counts as 0). This suggests a new way of calculating expected utilities on a more round-by-round basis.

### Lemma 2

### Proof

To find the probability of player *i* choosing *a* in round *k*, we usually have to calculate the probabilities of all actions in all previous rounds. After all, player *i* reacts to player \(-i\)’s previous move, who in turn reacts to player *i*’s move in round \(k-2\), and so on. This is what makes Eq. 7 so long. However, imagine that player \(-i\) uses a stationary strategy. This, of course, means that player \(-i\)’s probability distribution over moves in round *k* (assuming the game indeed reaches round *k*) can be computed directly as \(\pi _{-i}(b)\). Player *i*’s distribution over moves in round *k* is almost as simple to calculate, because it only depends on player \(-i\)’s distribution over moves in round \(k-1\), which can also be calculated directly. We hence get the following lemma.

### Lemma 3

### Proof

*k*. For \(k=1\), it is

*k*, it is also true for \(k+1\):\(\square \)

## 3 Robust program equilibrium in the prisoner’s dilemma

Payoff matrix for the prisoner’s dilemma

Player 1 | Player 2 | |
---|---|---|

Cooperate | Defect | |

Cooperate | 3, 3 | 1, 4 |

Defect | 4, 1 | 2, 2 |

I propose to use the following decision rule: with a probability of \(\epsilon \in \left( 0,1\right] \), cooperate. Otherwise, act as your opponent plays against you. I will call this strategy \(\epsilon \)GroundedFairBot. A description of the algorithm in pseudo-code is given in Algorithm 1.^{3}

The proposed program combines two main ideas. First, it is a version of FairBot (Barasz et al. 2014). That is, it chooses the action that its opponent would play against itself. As player \(-i\) would like player *i* to cooperate, \(\epsilon \)GroundedFairBot thus incentivizes cooperation, as long as \(\epsilon <1/2\). In this, it resembles the tit for tat strategy in the iterated prisoner’s dilemma (IPD) (famously discussed by Axelrod 2006), which takes an empirical approach to behaving as the opponent behaves against itself. Here, the probability of the game ending must be sufficiently small (again, less than 1 / 2 for the given payoffs) in each round for the threat of punishment and the allure of reward to be persuasive reasons to cooperate.

To better understand how \(\epsilon \)GroundedFairBot works, consider its behavior against a few different opponents. When \(\epsilon \)GroundedFairBot faces NaiveFairBot, then both cooperate. For illustration, a dynamic call graph of their interaction is given in Fig. 1. It is left as an exercise for the reader to analyze \(\epsilon \)GroundedFairBot’s behavior against other programs, such as another instance of \(\epsilon \)GroundedFairBot or a variation of \(\epsilon \)GroundedFairBot that defects rather than cooperates with probability \(\epsilon \).

In addition to proving theoretical results (as done below), it would be useful to test \(\epsilon \)GroundedFairBot “in practice”, i.e., against other proposed programs for the prisoner’s dilemma with access to one another’s source code. I only found one informal tournament for this version of the prisoner’s dilemma. It was conducted in 2013 by Alex Mennen on the online forum and community blog LessWrong.^{4} In the original set of submissions, \(\epsilon \)GroundedFairBot would have scored 6th out of 21. The reason why it is not a serious contender for first place is that it does not take advantage of the many exploitable submissions (such as bots that decide without looking at their opponent’s source code). Once one removes the bottom 9 programs, \(\epsilon \)GroundedFairBot scores second place. If one continues this process of eliminating unsuccessful programs for another two rounds, \(\epsilon \)GroundedFairBot ends up among the four survivors that cooperate with each other.^{5}

## 4 From iterated game strategies to robust program equilibria

Again, our proposed program combines two main ideas. First, \(\epsilon \)Grounded\(\pi _i\)Bot responds to how the opponent plays against \(\epsilon \)Grounded\(\pi _i\)Bot. In this, it resembles the behavior of \(\pi _i\) in \(G_\epsilon \). As we will see, this leads \(\epsilon \)Grounded\(\pi _i\)Bot to inherit many of \(\pi _i\)’s properties. In particular, if (like tit for tat) \(\pi _i\) uses some mechanism to incentivize its opponent to converge on a desired action, then \(\epsilon \)Grounded\(\pi _i\)Bot incentivizes that action in a similar way.

We now ground these two intuitions formally. First, we discuss \(\epsilon \)Grounded\(\pi _i\)Bot’s halting behavior. We then show that, in some sense, \(\epsilon \)Grounded\(\pi _i\)Bot behaves in *G* like \(\pi _i\) does in \(G_\epsilon \).

### 4.1 Halting behavior

For a program to be a viable option in the “transparent” version of *G*, it should halt against a wide variety of opponents. Otherwise, it may be excluded from \( PROG (G)\) in our formalism. Besides, it should be efficient enough to be practically useful. As with \(\epsilon \)GroundedFairBot, the main reason why \(\epsilon \)Grounded\(\pi _i\)Bot is benign in terms of the risk of infinite loops is that it generates strictly less than one new function call in expectation and never starts more than one. While we have no formal machinery for analyzing the “loop risk” of a program, it is easy to show the following theorem.

### Theorem 4

### Proof

It suffices to discuss the cases in which \(p_{-i}\) calls \(p_i\) once with certainty, because if any of our claims were refuted by some program \(p_{-i}\), they would also be refuted by a version of that program that calls \(p_i\) once with certainty. If \(p_{-i}\) calls \(p_i\) once with certainty, then the dynamic call graphs of both \(\epsilon \)Grounded\(\pi _i\)Bot and \(p_{-i}\) look similar to the one drawn in 1. In particular, it only contains one infinite path and that path has a probability of at most \(\lim _{i\rightarrow \infty } (1-\epsilon )^i=0\).

Note that this argument would not work if there were more than two players or if the strategy for the iterated game were to depend on more than just the last opponent move, because in these cases, the natural extension of \(\epsilon \)Grounded\(\pi _i\)Bot would have to make multiple calls to other programs. Indeed, this is one of the reasons why the present paper only discusses two-player games and iterated games with such short-term memory. Whether a similar result can nonetheless be obtained for more than 2 players and strategies that depend on the entire past history is left to future research.

As special cases, for any strategy \(\pi _{-i}\), \(\epsilon \)Grounded\(\pi _i\)Bot terminates against \(\epsilon \)Grounded\(\pi _{-i}\)Bot and Naive\(\pi _{-i}\)Bot (and these programs in turn terminate against \(\epsilon \)Grounded\(\pi _i\)Bot). The latter is especially remarkable. Our \(\epsilon \)Grounded\(\pi _i\)Bot terminates and leads the opponent to terminate even if the opponent is so careless that it would not even terminate against a version of itself or, in our formalism, if \( PROG _{-i}(G)\) gives the opponent more leeway to work with simulations.

### 4.2 Relationship to the underlying iterated game strategy

### Theorem 5

*G*be a game, \(\pi _i\) be a strategy for player

*i*in \(G_\epsilon \), \(p_i=\epsilon \mathrm {Grounded} \pi _i \mathrm {Bot}\) and \(p_{-i}\in PROG _{-i}(G)\) be any opponent program. We define \(\pi _{-i}= apply (p_{-i},(p_i,p_{-i}))\), which makes \(\pi _{-i}\) a strategy for player \(-i\) in \(G_\epsilon \). Then

### Proof

Note that the program side of the proof does not involve any “mental \(G_\epsilon \)”. Using Theorem 5, we can easily prove a number of property transfers from \(\pi _i\) to \(\epsilon \)Grounded\(\pi _i\)Bot.

### Corollary 6

*G*be a game. Let \(\pi _i\) be a computable strategy for player

*i*in \(G_\epsilon \) and let \(p_i=\epsilon \mathrm {Grounded} \pi _i \mathrm {Bot}\).

- 1.
If \(p_{-i}\in B_{-i}(p_i)\), then \( apply (p_{-i},(p_i,p_{-i}))\in B_{-i}^{s,c}(\pi _i)\).

- 2.
If \(\pi _{-i}\in B_i^{s,c}(\pi _i)\) and \( apply (p_{-i},(p_i,p_{-i}))=\pi _{-i}\), then \(p_{-i}\in B_{-i}(p_i)\).

### Proof

Both 1. and 2. follow directly from Theorem 5. \(\square \)

Intuitively speaking, Corollary 6 shows that \(\pi _i\) and \(\epsilon \mathrm {Grounded} \pi _i \mathrm {Bot}\) provoke the same best responses. Note that best responses in the program game only correspond to best stationary computable best responses in the repeated game. The computability requirement is due to the fact that programs cannot imitate incomputable best responses. The corresponding strategies for the repeated game further have to be stationary because \(\epsilon \)Grounded\(\pi _i\)Bot only incentivizes opponent behavior for a single situation, namely the situation of playing against \(\epsilon \)Grounded\(\pi _i\)Bot. As a special case of Corollary 6, if \(\epsilon <1/2\), the best response to \(\epsilon \)GroundedFairBot is a program that cooperates against \(\epsilon \)GroundedFairBot because in a IPD with a probability of ending of less than 1 / 2 a program that cooperates is the best (stationary computable) response to tit for tat.

### Corollary 7

Let *G* be a game. If \((\pi _1,\pi _2)\) is a Nash equilibrium of \(G_\epsilon \) and \(\pi _1\) and \(\pi _2\) are computable, then \((\epsilon \mathrm {Grounded} \pi _1 \mathrm {Bot},\epsilon \mathrm {Grounded} \pi _2 \mathrm {Bot})\) is a program equilibrium of *G*.

### Proof

Follows directly from Theorem 5. \(\square \)

### 4.3 Exploitability

Besides forming an equilibrium against many opponents (including itself) and incentivizing cooperation, another important reason for tit for tat’s success is that it is “not very exploitable” (Axelrod 2006). That is, when playing against tit for tat, it is impossible to receive a much higher reward than tit for tat itself. We now show that (in)exploitability transfers from strategies \(\pi _i\) to \(\epsilon \)Grounded\(\pi _i\)Bots.

*symmetric*if \(A_1=A_2\) and \(u_1(a,b)=u_2(b,a)\) for all \(a\in A_1\) and \(b\in A_2\). If

*G*is symmetric, we call a strategy \(\pi _i\) for \(G_\epsilon \)

*N*-exploitable in \(G_\epsilon \) for an \(N\in {\mathbb {R}}_{\ge 0}\) if there exists a \(\pi _{-i}\), such that

*N*-inexploitable if it is not

*N*-exploitable.

*G*we call a program \(p_i\)

*N*-exploitable for an \(N\in {\mathbb {R}}_{\ge 0}\) if there exists a \(p_{-i}\), such that

*N*-inexploitable if it is not

*N*-exploitable.

### Corollary 8

Let *G* be a game and \(\pi _i\) be an *N*-inexploitable strategy for \(G_\epsilon \). Then \(\epsilon \)Grounded\(\pi _i\)Bot is \(\epsilon N\)-inexploitable.

### Proof

Follows directly from Theorem 5. \(\square \)

Notice that if – like tit for tat in the IPD – \(\pi _i\) is *N*-inexploitable in \(G_\epsilon \) for all \(\epsilon \), then we can make \(\epsilon \)Grounded\(\pi _i\)Bot arbitrarily close to 0-inexploitable by decreasing \(\epsilon \).

## 5 Conclusion

- 1.
Construct the game’s corresponding repeated game. In particular, we consider repeated games in which each player can only react to the opponent’s move in the previous round (rather than the entire history of previous moves by both players) and the game ends with some small probability \(\epsilon \) after each round.

- 2.
Construct a Nash equilibrium for that iterated game.

- 3.
Convert each of the strategies into a computer program that works as follows (see Algorithm 3): with probability \(\epsilon \) do what the strategy does in the first round. With probability \(1-\epsilon \), apply the opponent program to this program; then do what the underlying strategy would reply to the opponent program’s output.

## Footnotes

- 1.
- 2.
For keeping our notation simple, we will assume that our programs receive their own source code as input in addition to their opponent’s. If \( PROG _i(G)\) is sufficiently powerful, then by Kleene’s second recursion theorem, programs could also refer to their own source code without receiving it as an input (Cutland 1980, ch. 11).

- 3.
This program was proposed by Abram Demski in a conversation discussing similar (but worse) ideas of mine. It has also been proposed by Jessica Taylor at https://agentfoundations.org/item?id=524, though in a slightly different context.

- 4.
The tournament was announced at https://www.lesserwrong.com/posts/BY8kvyuLzMZJkwTHL/prisoner-s-dilemma-with-visible-source-code-tournament and the results at https://www.lesserwrong.com/posts/QP7Ne4KXKytj4Krkx/prisoner-s-dilemma-tournament-results-0.

- 5.
For a more detailed analysis and report on my methodology, see https://casparoesterheld.files.wordpress.com/2018/02/transparentpdwriteup.pdf.

## Notes

### Acknowledgements

I am indebted to Max Daniel and an anonymous referee for many helpful comments.

## References

- Anderlini, L. (1990). Some notes on Church’s thesis and the theory of games.
*Theory and Decision*,*29*(1), 19–52.CrossRefGoogle Scholar - Axelrod, R. (2006).
*The Evolution of Cooperation*. New York: Basic Books.Google Scholar - Barasz, M., et al. (2014). Robust cooperation in the prisoner’s dilemma: Program equilibrium via provability logic. arxiv: 1401.5577.
- Binmore, K. (1987). Modeling rational players: Part I.
*Economics and Philosophy*,*3*(2), 179–214.CrossRefGoogle Scholar - Binmore, K. (1988). Modeling rational players: Part II.
*Economics and Philosophy*,*4*(1), 9–55.CrossRefGoogle Scholar - Critch, A. (2016). Parametric bounded Löb’s theorem and robust cooperation of bounded agents. arxiv:1602.04184.
- Cutland, N. (1980).
*Computability: An introduction to recursive function theory*. Cambridge: Cambridge University Press.CrossRefGoogle Scholar - van der Hoek, W., Witteveen, C., & Wooldridge, M. (2013). Program equilibrium—A program reasoning approach.
*International Journal of Game Theory*,*42*(3), 639–671.CrossRefGoogle Scholar - Howard, J. V. (1988). Cooperation in the prisoner’s dilemma.
*Theory and Decision*,*24*(3), 203–213.CrossRefGoogle Scholar - McAfee, R. P. (1984). Effective computability in economic decisions. http://www.mcafee.cc/Papers/PDF/EffectiveComputability.pdf. Accessed 10 Nov 2018.
- Osborne, M. J. (2004).
*An introduction to game theory*. Oxford: Oxford University Press.Google Scholar - Tennenholtz, M. (2004). Program equilibrium.
*Games and Economic Behavior*,*49*(2), 363–373.CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.