Robust program equilibrium

Oesterheld, Caspar

doi:10.1007/s11238-018-9679-3

Robust program equilibrium

Open access
Published: 23 November 2018

Volume 86, pages 143–159, (2019)
Cite this article

Download PDF

You have full access to this open access article

Theory and Decision Aims and scope Submit manuscript

Robust program equilibrium

Download PDF

Caspar Oesterheld ORCID: orcid.org/0000-0003-4222-7855¹^nAff2

5430 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

One approach to achieving cooperation in the one-shot prisoner’s dilemma is Tennenholtz’s (Games Econ Behav 49(2):363–373, 2004) program equilibrium, in which the players of a game submit programs instead of strategies. These programs are then allowed to read each other’s source code to decide which action to take. As shown by Tennenholtz, cooperation is played in an equilibrium of this alternative game. In particular, he proposes that the two players submit the same version of the following program: cooperate if the opponent is an exact copy of this program and defect otherwise. Neither of the two players can benefit from submitting a different program. Unfortunately, this equilibrium is fragile and unlikely to be realized in practice. We thus propose a new, simple program to achieve more robust cooperative program equilibria: cooperate with some small probability $\epsilon $ and otherwise act as the opponent acts against this program. I argue that this program is similar to the tit for tat strategy for the iterated prisoner’s dilemma. Both “start” by cooperating and copy their opponent’s behavior from “the last round”. We then generalize this approach of turning strategies for the repeated version of a game into programs for the one-shot version of a game to other two-player games. We prove that the resulting programs inherit properties of the underlying strategy. This enables them to robustly and effectively elicit the same responses as the underlying strategy for the repeated game.

Quantal response equilibrium for the Prisoner’s Dilemma game in Markov strategies

Article Open access 16 March 2022

Human players manage to extort more than the mutual cooperation payoff in repeated social dilemmas

Article Open access 19 August 2021

Efficiency may improve when defectors exist

Article 29 August 2015

1 Introduction

Much has been written about rationalizing non-Nash equilibrium play in strategic-form games. Most prominently, game theorists have discussed how cooperation may be achieved in the prisoner’s dilemma, where mutual cooperation is not a Nash equilibrium but Pareto-superior to mutual defection. One of the most successful approaches is the repetition of a game, and in particular the iterated prisoner’s dilemma (Axelrod 2006).

Another approach is to introduce commitment mechanisms of some sort. In this paper, we will discuss one particular commitment mechanism: Tennenholtz’s (2004) program equilibrium formalism (Sect. 2.2). Here, the idea is that in place of strategies, players submit programs which compute strategies and are given access to each other’s source code. The programs can then encode credible commitments, such as some version of “if you cooperate, I will cooperate”.

As desired, Tennenholtz (2004, Sect. 3, Theorem 1) shows that mutual cooperation is played in a program equilibrium of the prisoner’s dilemma. However, Tennenholtz’ equilibrium is very fragile. Essentially, it consists of two copies of a program that cooperates if it faces an exact copy of itself (cf. McAfee 1984; Howard 1988). Even small deviations from that program break the equilibrium. Thus, achieving cooperation in this way is only realistic if the players can communicate beforehand and settle on a particular outcome.

Another persuasive critique of this trivial equilibrium is that the model of two players submitting programs is only a metaphor, anyway. In real life, the programs may instead be the result of an evolutionary process (Binmore 1988 pp. 14f.) and Tennenholtz’ equilibrium is a hard to obtain by such a process. Alternatively, if we view our theory as normative rather than descriptive, we may view the programs themselves as the target audience of our recommendations. This also means that these agents will already have some form of source code—e.g., one that derives and considers the program equilibria of the game—and it is out of their realm of power to change that source code to match some common standard. However, they may still decide on some procedure for thinking about this particular problem in such a way that enables cooperation with other rationally pre-programmed agents.

Noting the fragility of Tennenholtz’ proposed equilibrium, it has been proposed to achieve a more robust program equilibrium by letting the programs reason about each other (van der Hoek et al. 2013; Barasz et al. 2014; Critch 2016). For example, Barasz et al. (2014, Sect. 3) propose a program FairBot—variations of which we will see in this paper—that cooperates if Peano arithmetic can prove that the opponent cooperates against FairBot. FairBot cooperates (via Löb’s theorem) more robustly against different versions of itself. These proposals are very elegant and certainly deserve further attention. However, their benefits come at the cost of being computationally expensive.

In this paper, I thus derive a class of program equilibria that I will argue to be more practical. In the case of the prisoner’s dilemma, I propose a program that cooperates with a small probability and otherwise acts as the opponent acts against itself (see Algorithm 1). Doing what the opponent does—à la FairBot—incentivizes cooperation. Cooperating with a small probability allows us to avoid infinite loops that would arise if we merely predicted and copied our opponent’s action (see Algorithm 2). This approach to a robust cooperation program equilibrium in the prisoner’s dilemma is described in Sect. 3.

We then go on to generalize the construction exemplified in the prisoner’s dilemma (see Sect. 4). In particular, we show how strategies for the repeated version of a game can be used to construct good programs for the one-shot version of that game. We show that many of the properties of the underlying strategy of the repeated game carry over to the program for the stage game. We can thus construct “good” programs and program equilibria from “good” strategies and Nash equilibria.

2 Preliminaries

2.1 Strategic-form games

For reference, we begin by introducing some basic terminology and formalism for strategic-form games. For an introduction, see, e.g., Osborne (2004). For reasons that will become apparent later on, we limit our treatment to two-player games.

A two-player strategic game$G=(A_1,A_2,u_1,u_2)$ consists of two countable sets of moves $A_i$ and for both players $i\in \{1,2\}$ a bounded utility function $u_i:A_1\times A_2 \rightarrow {\mathbb {R}}$. A (mixed) strategy for player i is a probability distribution $\pi _i$ over $A_i$.

Given a strategy profile$(\pi _1,\pi _2)$ the probability of an outcome$(a_1,a_2)\in A_1\times A_2$ is

$$\begin{aligned} P(a_1,a_2 \mid \pi _1,\pi _2 ):=\pi _1(a_1)\cdot \pi _2(a_2). \end{aligned}$$

(1)

The expected value for player i given that strategy profile is

$$\begin{aligned} {\mathbb {E}}\left[ u_i \mid \pi _1,\pi _2 \right] :=\sum _{(a_1,a_2)\in A_1\times A_2} P(a_1,a_2 \mid \pi _1,\pi _2 ) \cdot u_i (a_1,a_2). \end{aligned}$$

Note that because the utility function is bounded, the sum converges absolutely, such that the order of the action pairs does not affect the sum’s value.

2.2 Program equilibrium

We now introduce the concept of program equilibrium, first proposed by Tennenholtz (2004). The main idea is to replace strategies with computer programs that are given access to each other’s source code.^{Footnote 1} The programs then give rise to strategies.

For any game G, we first need to define the set of program profiles$ PROG (G)$ consisting of pairs of programs. The ith entry of an element of $ PROG (G)$ must be a program source code $p_i$ that, when interpreted by a function $ apply $, probabilistically map program profiles^{Footnote 2} onto $A_i$.

We require that for any program profile $(p_1,p_2)\in PROG (G)$, both programs halt. Otherwise, the profile would not give rise to a well-defined strategy. Whether $p_i$ halts depends on the program $p_{-i}$, it plays against, where (in accordance with convention in game theory) $-:\{1,2\}\rightarrow \{1,2\}:1\mapsto 2, 2 \mapsto 1$ and we write $-i$ instead of $-(i)$. For example, if $p_{i}$ runs $ apply (p_{-i},(p_i, p_{-i}))$, i.e., simulates the opponent, then that is fine as long as $p_{-i}$ does not also run $apply(p_i,(p_i,p_{-i}))$, which would yield an infinite loop. To avoid this mutual dependence, we will generally require that $ PROG (G)= PROG _1(G)\times PROG _2(G)$, where $ PROG _i(G)$ consists of programs for player i. Methods of doing this while maintaining expressive power include hierarchies of players—e.g., higher indexed players are allowed to simulate lower indexed ones but not vice versa—hierarchies of programs—programs can only call their opponents with simpler programs as input—requiring programs to have a “plan B” if termination can otherwise not be guaranteed, or allowing each player to only start strictly less than one simulation in expectation. These methods may also be combined. In this paper, we do not assume any particular definition of $ PROG (G)$. However, we assume that they can perform arbitrary computations as long as these computations are guaranteed to halt regardless of the output of the parts of the code that do depend on the opponent program. We also require that $ PROG (G)$ is compatible with our constructions. We will show our constructions to be so benign in terms of infinite loops that this is not too strong of an assumption.

Given a program profile $(p_1,p_2)$, we receive a strategy profile $( apply (p_1,(p_1,p_2)), apply (p_2,(p_1,p_2)))$. For any outcome $(a_1,a_2)$ of G, we define

$$\begin{aligned} P(a_1,a_2\mid p_1,p_2) :=P(a_1,a_2\mid apply (p_1,(p_1,p_2)), apply (p_2,(p_1,p_2)))\nonumber \\ \end{aligned}$$

(2)

and for every player $i\in \{1,2\}$, we define

$$\begin{aligned} {\mathbb {E}}\left[ u_i \mid p_1,p_2 \right] :=\sum _{(a_1,a_2)\in A_1\times A_2} P(a_1,a_2 \mid p_1,p_2 ) \cdot u_i (a_1,a_2). \end{aligned}$$

(3)

For player i, we define the (set-valued) best response function as

$$\begin{aligned} B_i(p_{-i})={{\mathrm{arg\,max}}}_{p_i\in PROG _i(G)} {\mathbb {E}}\left[ u_i \mid p_i,p_{-i} \right] . \end{aligned}$$

A program profile $(p_1,p_2)$ is a (weak) program equilibrium ofG if for $i\in \{1,2 \}$ it is $p_i\in B_i(p_{-i})$.

2.3 Repeated games

Our construction will involve strategies for the repeated version of a two-player game. Thus, for any game G, we define $G_{\epsilon }$ to be the repetition of G with a probability of $\epsilon \in \left( 0,1\right] $ of ending after each round. Both players of $G_{\epsilon }$ will be informed only of the last move of their opponent. This differs from the more typical assumption that players have access to the entire history of past moves. We will later see why this deviation is necessary. A strategy $\pi _i$ for player i non-deterministically maps opponent moves or the information of the lack thereof onto a move

$$\begin{aligned} \pi _i :\{ 0 \} \cup A_{-i} \rightsquigarrow A_i. \end{aligned}$$

Thus, for $a\in A_i, b\in A_{-i}$, $\pi _i(b,a):=\pi _i(b)(a)$ denotes the probability of choosing a given that the opponent played b in the previous round and $\pi _i(0,a):=\pi _i(0)(a)$ denotes the probability of choosing a in the first round. We call a strategy $\pi _i$stationary if for all $a\in A_i$, $\pi _i(b,a)$ is constant with respect to $b\in \{ 0 \} \cup A_{-i}$. If $\pi _i$ is stationary, we write $\pi _i(a):=\pi _i(b,a)$. The probability that the game follows a complete history of moves $h=a_0b_0a_1b_1\cdots a_nb_n$ and then ends is

$$\begin{aligned} P(h\mid (\pi _1,\pi _2)) :=\pi _1(0,a_0) \pi _2(0,b_0) \epsilon (1-\epsilon )^n \prod _{i=1}^n \pi _1(b_{i-1},a_i) \pi _2(a_{i-1},b_i). \end{aligned}$$

(4)

Note that the moves in the history always come in pairs $a_ib_i$ which are chosen “simultaneously” in response to $b_{i-1}$ and $a_{i-1}$, respectively. The expected value for player i given the strategy profile $(\pi _1,\pi _2)$ is

$$\begin{aligned} {\mathbb {E}}\left[ u_i \mid \pi _1,\pi _2 \right] :=\sum _{h\in (A_1\cdot A_2)^+} P(h \mid \pi _1,\pi _2 ) \cdot u_i (h), \end{aligned}$$

(5)

where $(A_1\cdot A_2)^+$ is the set of all histories and

$$\begin{aligned} u_i (a_0b_0a_1b_1\cdots a_nb_n) :=\sum _{i=0}^n u_i(a_i,b_i). \end{aligned}$$

(6)

The lax unordered summation in Eq. 5 is, again, unproblematic because of the absolute convergence of the series, which is a direct consequence of the proof of Lemma 1. Note how the organization of the history into pairs of moves allows us to apply the utility function of the stage game in Eq. 6.

For player i, we define the set-valued best response function as

$$\begin{aligned} B_i(\pi _{-i})={{\mathrm{arg\,max}}}_{\pi _i:\{0\}\cup A_{-i}\rightsquigarrow A_i} {\mathbb {E}}\left[ u_i \mid \pi _i,\pi _{-i} \right] . \end{aligned}$$

Analogously, $B_i^c(\pi _{-i})$ is the set of responses to $\pi _{-i}$ that are best among the computable ones, $B_i^s(\pi _{-i})$ the set of responses to $\pi _{-i}$ that are best among the stationary ones, and $B_i^{s,c}(\pi _{-i})$ the set of responses to $\pi _{-i}$ that are best among stationary computable strategies. A strategy profile $(\pi _1,\pi _2)$ is a (weak) Nash equilibrium of$G_\epsilon $ if for $i\in \{1,2 \}$ it is $\pi _i\in B_i(\pi _{-i})$.

We now prove a few lemmas that we will need later on. First, we have suggestively called the values P(h) probabilities, but we have not shown them to satisfy, say, Kolmogorov’s axioms. Additivity is not an issue, because we have only defined the probability for atomic events and non-negativity is obvious from the definition. However, we will also need the fact that the numbers we have called probabilities indeed sum to 1, which requires a few lines to prove.

Lemma 1

Let $G_\epsilon $ be a repeated game and $\pi _1,\pi _2$ be strategies for that game. Then

$$\begin{aligned} \sum _{h\in (A_1\cdot A_2)^+} P(h \mid \pi _1,\pi _2 ) = 1. \end{aligned}$$

Proof

For seeing why the second-to-last equation is true, notice that the inner-most sum is 1. Thus, the next sum is 1 as well, and so on. Since the ordering in the right-hand side of the first line is lax, and because only the second line is known to converge absolutely, the re-ordering is best understood from right to left. The last step uses the well-known formula $\sum _{k=0}^\infty x^k=1/(1-x)$ for the geometric series. $\square $

For any game $G_\epsilon $, $k\in {\mathbb {N}}_{+}$, $a\in A_1$, $b\in A_2$ and strategies $\pi _1$ and $\pi _2$ for $G_\epsilon $, we define

(7)

For $k=0$, we define

$$\begin{aligned} P_{0, G_\epsilon } (a,b\mid \pi _1,\pi _2) :=\pi _1(0,a)\cdot \pi _2(0,b). \end{aligned}$$

Intuitively speaking, $P_{k, G_\epsilon } (a,b\mid \pi _1,\pi _2)$ is the probability of reaching at least round k and that (a, b) is played in that round. With this

$$\begin{aligned} \sum _{ab\in A_1A_2} P_{k, G_\epsilon } (a,b\mid \pi _1,\pi _2) u_i(a,b) \end{aligned}$$

should be the expected utility from the kth round (where not getting to the kth round counts as 0). This suggests a new way of calculating expected utilities on a more round-by-round basis.

Lemma 2

Let $G_\epsilon $ be a game, and let $\pi _1,\pi _2$ be strategies for that game. Then

$$\begin{aligned} {\mathbb {E}}\left[ u_i \mid \pi _1,\pi _2 \right] = \sum _{k=0}^\infty \sum _{ab\in A_1A_2} P_{k, G_\epsilon } (a,b\mid \pi _1,\pi _2) u_i(a,b). \end{aligned}$$

Proof

$$\begin{aligned}&{\mathbb {E}}_{G_\epsilon } \left[ u_i \mid \pi _1, \pi _{2}\right] = \sum _{h\in (A_1A_2)^+} P(h\mid \pi _1,\pi _{2})u_i(h)\\&\quad \underset{\text {Eqs. }4,~6}{=} \sum _{a_0b_0\cdots a_nb_n\in (A_1A_2)^+} \pi _1(0,a_0) \pi _2(0,b_0) \epsilon (1-\epsilon )^n \left( \prod _{j=1}^n \pi _1(b_{j-1},a_j) \pi _{2}(a_{j-1},b_j) \right) \\&\qquad \quad \quad \cdot \sum _{k=0}^n u_i(a_k,b_k) = \sum _{k=0}^\infty \sum _{a_0b_0\cdots a_nb_n\in (A_1A_2)^{\ge k+1}} \pi _1(0,a_0) \pi _{2}(0,b_0) \epsilon (1-\epsilon )^n \\&\qquad \quad \quad \cdot \left( \prod _{j=1}^n \pi _1(b_{j-1},a_j) \pi _2(a_{j-1},b_j)\right) u_i(a_k,b_k)\\&\quad \quad = \sum _{k=0}^\infty \sum _{a_0b_0\cdots a_kb_k\in (A_1A_2)^{k+1}} \sum _{a_{k+1}b_{k+1}\cdots a_nb_n\in (A_1A_2)^*} \pi _1(0,a_0) \pi _2(0,b_0) \epsilon (1-\epsilon )^n u_i(a_k,b_k)\\&\qquad \qquad \cdot \left( \prod _{j=1}^{k} \pi _1(b_{j-1},a_j) \pi _2(a_{j-1},b_j) \right) \left( \prod _{j=k+1}^{n} \pi _1(b_{j-1},a_j) \pi _2(a_{j-1},b_j) \right) \\&\quad \quad = \sum _{k=0}^\infty \sum _{a_0b_0\cdots a_kb_k\in (A_1A_2)^{ k+1}} \pi _1(0,a_0) \pi _2(0,b_0) \epsilon (1-\epsilon )^k u_i(a_k,b_k)\\&\qquad \qquad \cdot \left( \prod _{j=1}^{k} \pi _1(b_{j-1},a_j) \pi _2(a_{j-1},b_j) \right) \\&\qquad \qquad \cdot \sum _{a_{k+1}b_{k+1}\cdots a_nb_n\in (A_1A_2)^*} \epsilon (1-\epsilon )^{n-k} \left( \prod _{j=k+1}^{n} \pi _1(b_{j-1},a_j) \pi _2(a_{j-1},b_j) \right) \\&\quad \underset{\text {lemma }1}{=} \sum _{k=0}^\infty \sum _{a_0b_0\cdots a_kb_k\in (A_1A_2)^{ k+1}} \pi _1(0,a_0) \pi _2(0,b_0) \epsilon (1-\epsilon )^k u_i(a_k,b_k)\\&\qquad \qquad \cdot \left( \prod _{j=1}^{k} \pi _1(b_{j-1},a_j) \pi _2(a_{j-1},b_j) \right) \\&\quad \underset{\text {Eq. }7}{=} \sum _{k=0}^\infty \sum _{a_kb_k\in A_1A_2} P_k(a_k,b_k \mid \pi _1,\pi _2) u_i(a_k,b_k) \end{aligned}$$

$\square $

To find the probability of player i choosing a in round k, we usually have to calculate the probabilities of all actions in all previous rounds. After all, player i reacts to player $-i$’s previous move, who in turn reacts to player i’s move in round $k-2$, and so on. This is what makes Eq. 7 so long. However, imagine that player $-i$ uses a stationary strategy. This, of course, means that player $-i$’s probability distribution over moves in round k (assuming the game indeed reaches round k) can be computed directly as $\pi _{-i}(b)$. Player i’s distribution over moves in round k is almost as simple to calculate, because it only depends on player $-i$’s distribution over moves in round $k-1$, which can also be calculated directly. We hence get the following lemma.

Lemma 3

Let $G_\epsilon $ be a game, let $\pi _i$ be a any strategy for $G_\epsilon $ and let $\pi _{-i}$ be a stationary strategy for $G_\epsilon $. Then, for all $k\in {\mathbb {N}}_+$, it is

$$\begin{aligned} P_{k, G_\epsilon } (a,b\mid \pi _i,\pi _{-i}) = (1-\epsilon )^k \sum _{b'\in A_{-i}} \pi _{-i}(b') \pi _{-i}(b) \pi _{i} ( b',a). \end{aligned}$$

Proof

We conduct our proof by induction over k. For $k=1$, it is

$$\begin{aligned}&P_1(a,b \mid \pi _i, \pi _{-i}) \underset{\text {Eq. }7}{=} (1-\epsilon ) \sum _{a_0b_0} \pi _i(b_0,a)\pi _{-i}(a_0,b) \pi _i(0,a_0) \pi _{-i}(0,b_0)\\&\quad = (1-\epsilon )\sum _{b_0} \pi _i(b_0,a) \pi _{-i}(b) \pi _{-i}(b_0) \sum _{a_0}\pi _i(0,a_0)\\&\quad = (1-\epsilon )\sum _{b_0} \pi _i(b_0,a) \pi _{-i}(b) \pi _{-i}(b_0). \end{aligned}$$

If the lemma is true for k, it is also true for $k+1$:

$\square $

3 Robust program equilibrium in the prisoner’s dilemma

Discussions of the program equilibrium have traditionally used the well-known prisoner’s dilemma (or trivial variations thereof) as an example to show how the program equilibrium rationalizes cooperation where the Nash equilibrium fails (e.g., Tennenholtz 2004, Sect. 3; McAfee 1984; Howard 1988; Barasz et al. 2014). The present paper is no exception. In this section, we will present our main idea using the example of the prisoner’s dilemma; the next section gives the more general construction and proofs of properties of that construction. For reference, the payoff matrix of the prisoner’s dilemma is given in Table 1.

Table 1 Payoff matrix for the prisoner’s dilemma

Full size table

I propose to use the following decision rule: with a probability of $\epsilon \in \left( 0,1\right] $, cooperate. Otherwise, act as your opponent plays against you. I will call this strategy $\epsilon $GroundedFairBot. A description of the algorithm in pseudo-code is given in Algorithm 1.^{Footnote 3}

The proposed program combines two main ideas. First, it is a version of FairBot (Barasz et al. 2014). That is, it chooses the action that its opponent would play against itself. As player $-i$ would like player i to cooperate, $\epsilon $GroundedFairBot thus incentivizes cooperation, as long as $\epsilon <1/2$. In this, it resembles the tit for tat strategy in the iterated prisoner’s dilemma (IPD) (famously discussed by Axelrod 2006), which takes an empirical approach to behaving as the opponent behaves against itself. Here, the probability of the game ending must be sufficiently small (again, less than 1 / 2 for the given payoffs) in each round for the threat of punishment and the allure of reward to be persuasive reasons to cooperate.

The second main idea behind $\epsilon $GroundedFairBot is that it cooperates with some small probability $\epsilon $. First and foremost, this avoids running into the infinite loop that a naive implementation of FairBot—see Algorithm 2— runs into when playing against opponents who, in turn, try to simulate FairBot. Note, again, the resemblance with the tit for tat strategy in the iterated prisoner’s dilemma, which cooperates when no information about the opponent’s strategy is available.

To better understand how $\epsilon $GroundedFairBot works, consider its behavior against a few different opponents. When $\epsilon $GroundedFairBot faces NaiveFairBot, then both cooperate. For illustration, a dynamic call graph of their interaction is given in Fig. 1. It is left as an exercise for the reader to analyze $\epsilon $GroundedFairBot’s behavior against other programs, such as another instance of $\epsilon $GroundedFairBot or a variation of $\epsilon $GroundedFairBot that defects rather than cooperates with probability $\epsilon $.

When playing against strategies that are also based on simulating their opponent, we could think of $\epsilon $GroundedFairBot as playing a “mental IPD”. If the opponent program decides whether to cooperate, it has to consider that it might currently only be simulated. Thus, it will choose an action with an eye toward gauging a favorable reaction from $\epsilon $GroundedFairBot one recursion level up. Cooperation in the first “round” is an attempt to steer the mental IPD into a favorable direction, at the cost of cooperating if $ sample \left( 0,1\right) <\epsilon $ already occurs in the first round.

In addition to proving theoretical results (as done below), it would be useful to test $\epsilon $GroundedFairBot “in practice”, i.e., against other proposed programs for the prisoner’s dilemma with access to one another’s source code. I only found one informal tournament for this version of the prisoner’s dilemma. It was conducted in 2013 by Alex Mennen on the online forum and community blog LessWrong.^{Footnote 4} In the original set of submissions, $\epsilon $GroundedFairBot would have scored 6th out of 21. The reason why it is not a serious contender for first place is that it does not take advantage of the many exploitable submissions (such as bots that decide without looking at their opponent’s source code). Once one removes the bottom 9 programs, $\epsilon $GroundedFairBot scores second place. If one continues this process of eliminating unsuccessful programs for another two rounds, $\epsilon $GroundedFairBot ends up among the four survivors that cooperate with each other.^{Footnote 5}

4 From iterated game strategies to robust program equilibria

We now generalize the construction from the previous section. Given any computable strategy $\pi _i$ for a sequential game $G_\epsilon $, I propose the following program: with a small probability $\epsilon $ sample from $\pi _i(0)$. Otherwise, act how $\pi _i$ would respond (in the sequential game $G_\epsilon $) to the action that the opponent takes against this program. I will call this program $\epsilon $Grounded$\pi _i$Bot. A description of the program in pseudo-code is given in Algorithm 3. As a special case, $\epsilon $GroundedFairBot arises from $\epsilon $Grounded$\pi _i$Bot by letting $\pi _i$ be tit for tat.

Again, our proposed program combines two main ideas. First, $\epsilon $Grounded$\pi _i$Bot responds to how the opponent plays against $\epsilon $Grounded$\pi _i$Bot. In this, it resembles the behavior of $\pi _i$ in $G_\epsilon $. As we will see, this leads $\epsilon $Grounded$\pi _i$Bot to inherit many of $\pi _i$’s properties. In particular, if (like tit for tat) $\pi _i$ uses some mechanism to incentivize its opponent to converge on a desired action, then $\epsilon $Grounded$\pi _i$Bot incentivizes that action in a similar way.

Second, it—again—terminates immediately with some small probability $\epsilon $ to avoid the infinite loops that Naive$\pi _i$Bot—see Algorithm 4—runs into. Playing $\pi _i(0)$ in particular is partly motivated by the “mental $G_\epsilon $” that $\epsilon $Grounded$\pi _i$Bot plays against some opponents (such as $\epsilon $Grounded$\pi _{-i}$Bots or Naive$\pi _{-i}$Bots). The other motivation is to make the relationship between $\epsilon $Grounded$\pi _i$Bot and $\pi _i$ cleaner. In terms of the strategies that are optimal against $\epsilon $Grounded$\pi _i$Bot, the choice of that constant action cannot matter much if $\epsilon $ is small. Consider, again, the analogy with tit for tat. Even if tit for tat started with defection, one should still attempt to cooperate with it. In practice, however, it has turned out that the “nice” version of tit for tat (and nice strategies in general) are more successful (Axelrod 2006, ch. 2). The transparency in the program equilibrium may render such “signals of cooperativeness” less important—e.g., against programs like Barasz et al.’s (2014, Sect. 3) FairBot. Nevertheless, it seems plausible that—if only because of mental $G_\epsilon $-related considerations—in transparent games the “initial” actions matter as well.

We now ground these two intuitions formally. First, we discuss $\epsilon $Grounded$\pi _i$Bot’s halting behavior. We then show that, in some sense, $\epsilon $Grounded$\pi _i$Bot behaves in G like $\pi _i$ does in $G_\epsilon $.

4.1 Halting behavior

For a program to be a viable option in the “transparent” version of G, it should halt against a wide variety of opponents. Otherwise, it may be excluded from $ PROG (G)$ in our formalism. Besides, it should be efficient enough to be practically useful. As with $\epsilon $GroundedFairBot, the main reason why $\epsilon $Grounded$\pi _i$Bot is benign in terms of the risk of infinite loops is that it generates strictly less than one new function call in expectation and never starts more than one. While we have no formal machinery for analyzing the “loop risk” of a program, it is easy to show the following theorem.

Theorem 4

Let $\pi _i$ be a computable strategy for a game $G_\epsilon $. Furthermore, let $p_{-i}$ be any program (not necessarily in $ PROG _i(G)$) that calls $ apply (p_i,(p_i,p_{-i}))$ at most once and halts with probability 1 if $ apply $ halts with probability 1. Then $\epsilon $Grounded$\pi _i$Bot and $p_{-i}$ halt against each other with probability 1 and the expected number of steps required for executing $\epsilon $Grounded$\pi _i$Bot is at most

$$\begin{aligned} T_{\pi _i} + (T_{\pi _i}+T_{p_{-i}}) \frac{1-\epsilon }{\epsilon }, \end{aligned}$$

(8)

where $T_{\pi _i}$ is the maximum number of steps to sample from $\pi _i$, and $T_{p_{-i}}$ is the maximum number of steps needed to sample from $apply(p_{-i},(\epsilon \mathrm {Grounded}\pi _i\mathrm {Bot},p_{-i}))$ excluding the steps needed to execute $ apply (\epsilon \mathrm {Grounded}\pi _i\mathrm {Bot},(\epsilon \mathrm {Grounded}\pi _i\mathrm {Bot}, p_{-i}))$.

Proof

It suffices to discuss the cases in which $p_{-i}$ calls $p_i$ once with certainty, because if any of our claims were refuted by some program $p_{-i}$, they would also be refuted by a version of that program that calls $p_i$ once with certainty. If $p_{-i}$ calls $p_i$ once with certainty, then the dynamic call graphs of both $\epsilon $Grounded$\pi _i$Bot and $p_{-i}$ look similar to the one drawn in 1. In particular, it only contains one infinite path and that path has a probability of at most $\lim _{i\rightarrow \infty } (1-\epsilon )^i=0$.

For the time complexity, we can consider the dynamic call graph as well. The policy $\pi _i$ has to be executed at least once (with probability $\epsilon $ with the input 0 and with probability $1-\epsilon $ against an action sampled from $apply(p_{-i},(\epsilon \mathrm {Grounded}\pi _i\mathrm {Bot},p_{-i})$). With a probability of $(1-\epsilon )$, we also have to execute the non-simulation part of $p_{-i}$ and, for a second time, $\pi _i$. And so forth. The expected number of steps to execute $\epsilon \mathrm {Grounded}\pi _i\mathrm {Bot}$ is thus

$$\begin{aligned} T_{\pi _i} + \sum _{j=1}^\infty (1-\epsilon )^j (T_{\pi _i} +T_{p_{-i}}), \end{aligned}$$

which can be shown to be equal to the term in 8 by using the well-known formula $\sum _{k=0}^\infty x^k=1/(1-x)$ for the geometric series. $\square $

Note that this argument would not work if there were more than two players or if the strategy for the iterated game were to depend on more than just the last opponent move, because in these cases, the natural extension of $\epsilon $Grounded$\pi _i$Bot would have to make multiple calls to other programs. Indeed, this is one of the reasons why the present paper only discusses two-player games and iterated games with such short-term memory. Whether a similar result can nonetheless be obtained for more than 2 players and strategies that depend on the entire past history is left to future research.

As special cases, for any strategy $\pi _{-i}$, $\epsilon $Grounded$\pi _i$Bot terminates against $\epsilon $Grounded$\pi _{-i}$Bot and Naive$\pi _{-i}$Bot (and these programs in turn terminate against $\epsilon $Grounded$\pi _i$Bot). The latter is especially remarkable. Our $\epsilon $Grounded$\pi _i$Bot terminates and leads the opponent to terminate even if the opponent is so careless that it would not even terminate against a version of itself or, in our formalism, if $ PROG _{-i}(G)$ gives the opponent more leeway to work with simulations.

4.2 Relationship to the underlying iterated game strategy

Theorem 5

Let G be a game, $\pi _i$ be a strategy for player i in $G_\epsilon $, $p_i=\epsilon \mathrm {Grounded} \pi _i \mathrm {Bot}$ and $p_{-i}\in PROG _{-i}(G)$ be any opponent program. We define $\pi _{-i}= apply (p_{-i},(p_i,p_{-i}))$, which makes $\pi _{-i}$ a strategy for player $-i$ in $G_\epsilon $. Then

$$\begin{aligned} {\mathbb {E}}_{G}\left[ u_i\mid p_i, p_{-i}\right] = \epsilon {\mathbb {E}}_{G_\epsilon }\left[ u_i\mid \pi _i, \pi _{-i}\right] . \end{aligned}$$

Proof

We separately transform the two expected values in the equation that is to be proven and then notice that they only differ by a factor $\epsilon $:

$$\begin{aligned}&{\mathbb {E}}_{G_{\epsilon }} \left[ u_i\mid \pi _i, \pi _{-i} \right] \underset{\text {lemma }2}{=} \sum _{k=0}^\infty \sum _{ab\in A_iA_{-i}} P_{k,G_\epsilon } (a,b\mid \pi _i,\pi _{-i})u_i(a,b)\\&\;\,\quad \underset{\text {lemma }3}{=} \sum _{ab\in A_iA_{-i}} \pi _i(0,a)\pi _{-i}(b)u_i(a,b) \\&\qquad \qquad + \sum _{k=1}^\infty \sum _{ab\in A_iA_{-i}} (1-\epsilon )^k \sum _{b'\in A_{-i}} \pi _{-i}(b') \pi _i(b',a) \pi _{-i} (b) u_i(a,b)\\&\quad \underset{\begin{array}{c} \text {absolute}\\ \text {convergence} \end{array}}{=} \sum _{ab\in A_iA_{-i}} \pi _i(0,a)\pi _{-i}(b)u_i(a,b) \\&\qquad \qquad + \sum _{ab\in A_iA_{-i}} \sum _{b'\in A_{-i}} \pi _{-i}(b') \pi _i(b',a) \pi _{-i} (b) u_i(a,b) \sum _{k=1}^\infty (1-\epsilon )^k \\&\quad \underset{\begin{array}{c} \text {sum of geo-}\\ \text {metric series} \end{array}}{=} \sum _{ab\in A_iA_{-i}} \pi _i(0,a)\pi _{-i}(b)u_i(a,b)\\&\qquad \qquad + \frac{1-\epsilon }{\epsilon } \sum _{ab\in A_iA_{-i}} \sum _{b'\in A_{-i}} \pi _{-i}(b') \pi _i(b',a) \pi _{-i}(b) u_i(a,b). \end{aligned}$$

The second-to-last step uses absolute convergence to reorder the sum signs. The last step uses the well-known formula $\sum _{k=0}^\infty x^k=1/(1-x)$ for the geometric series.

Onto the other expected value:

$$\begin{aligned}&{\mathbb {E}}_{G}\left[ u_i\mid p_i, p_{-i}\right] \nonumber \\&\quad \underset{\text {Eqs. }3,~~2,~~1}{=} \sum _{ab\in A_iA_{-i}} apply (p_i, (p_i,p_{-i}),a) apply (p_{-i}, (p_i,p_{-i}), b) u_i(a,b)\\&\quad \underset{\text {def.s }p_{i},~\pi _{-i}}{=}\sum _{ab\in A_iA_{-i}} \left( \epsilon \pi _i(0,a) + (1-\epsilon ) \sum _{b'\in A_{-i}} \pi _{-i}(b') \pi _i(b',a)\right) \pi _{-i}(b) u_i(a,b)\\&\quad \qquad = \epsilon \sum _{ab\in A_iA_{-i}} \pi _i(0,a) \pi _{-i}(b) u_i(a,b) \\&\qquad \qquad + (1-\epsilon ) \sum _{ab\in A_iA_{-i}}\sum _{b'\in A_{-i}} \pi _{-i}(b') \pi _i(b',a) \pi _{-i}(b)u_i(a,b). \end{aligned}$$

Here, $ apply (p_i, (p_i,p_{-i}),a):= apply (p_i, (p_i,p_{-i}))(a)$. The hypothesis follows immediately. $\square $

Note that the program side of the proof does not involve any “mental $G_\epsilon $”. Using Theorem 5, we can easily prove a number of property transfers from $\pi _i$ to $\epsilon $Grounded$\pi _i$Bot.

Corollary 6

Let G be a game. Let $\pi _i$ be a computable strategy for player i in $G_\epsilon $ and let $p_i=\epsilon \mathrm {Grounded} \pi _i \mathrm {Bot}$.

1.
If $p_{-i}\in B_{-i}(p_i)$, then $ apply (p_{-i},(p_i,p_{-i}))\in B_{-i}^{s,c}(\pi _i)$.
2.
If $\pi _{-i}\in B_i^{s,c}(\pi _i)$ and $ apply (p_{-i},(p_i,p_{-i}))=\pi _{-i}$, then $p_{-i}\in B_{-i}(p_i)$.

Proof

Both 1. and 2. follow directly from Theorem 5. $\square $

Intuitively speaking, Corollary 6 shows that $\pi _i$ and $\epsilon \mathrm {Grounded} \pi _i \mathrm {Bot}$ provoke the same best responses. Note that best responses in the program game only correspond to best stationary computable best responses in the repeated game. The computability requirement is due to the fact that programs cannot imitate incomputable best responses. The corresponding strategies for the repeated game further have to be stationary because $\epsilon $Grounded$\pi _i$Bot only incentivizes opponent behavior for a single situation, namely the situation of playing against $\epsilon $Grounded$\pi _i$Bot. As a special case of Corollary 6, if $\epsilon <1/2$, the best response to $\epsilon $GroundedFairBot is a program that cooperates against $\epsilon $GroundedFairBot because in a IPD with a probability of ending of less than 1 / 2 a program that cooperates is the best (stationary computable) response to tit for tat.

Corollary 7

Let G be a game. If $(\pi _1,\pi _2)$ is a Nash equilibrium of $G_\epsilon $ and $\pi _1$ and $\pi _2$ are computable, then $(\epsilon \mathrm {Grounded} \pi _1 \mathrm {Bot},\epsilon \mathrm {Grounded} \pi _2 \mathrm {Bot})$ is a program equilibrium of G.

Proof

Follows directly from Theorem 5. $\square $

4.3 Exploitability

Besides forming an equilibrium against many opponents (including itself) and incentivizing cooperation, another important reason for tit for tat’s success is that it is “not very exploitable” (Axelrod 2006). That is, when playing against tit for tat, it is impossible to receive a much higher reward than tit for tat itself. We now show that (in)exploitability transfers from strategies $\pi _i$ to $\epsilon $Grounded$\pi _i$Bots.

We call a game $G=(A_1,A_2,u_1,u_2)$symmetric if $A_1=A_2$ and $u_1(a,b)=u_2(b,a)$ for all $a\in A_1$ and $b\in A_2$. If G is symmetric, we call a strategy $\pi _i$ for $G_\epsilon $N-exploitable in $G_\epsilon $ for an $N\in {\mathbb {R}}_{\ge 0}$ if there exists a $\pi _{-i}$, such that

$$\begin{aligned} {\mathbb {E}}\left[ u_{-i}\mid \pi _{-i},\pi _{i} \right] > {\mathbb {E}}\left[ u_{i}\mid \pi _{-i},\pi _{i} \right] +N. \end{aligned}$$

We call $\pi _i$N-inexploitable if it is not N-exploitable.

Analogously, in a symmetric game G we call a program $p_i$N-exploitable for an $N\in {\mathbb {R}}_{\ge 0}$ if there exists a $p_{-i}$, such that

$$\begin{aligned} {\mathbb {E}}\left[ u_{-i}\mid p_{-i},p_{i} \right] > {\mathbb {E}}\left[ u_{i}\mid p_{-i},p_{i} \right] +N. \end{aligned}$$

We call $p_i$N-inexploitable if it is not N-exploitable.

Corollary 8

Let G be a game and $\pi _i$ be an N-inexploitable strategy for $G_\epsilon $. Then $\epsilon $Grounded$\pi _i$Bot is $\epsilon N$-inexploitable.

Proof

Follows directly from Theorem 5. $\square $

Notice that if – like tit for tat in the IPD – $\pi _i$ is N-inexploitable in $G_\epsilon $ for all $\epsilon $, then we can make $\epsilon $Grounded$\pi _i$Bot arbitrarily close to 0-inexploitable by decreasing $\epsilon $.

5 Conclusion

In this paper, we gave the following recipe for constructing a program equilibrium for a given two-player game:

1.
Construct the game’s corresponding repeated game. In particular, we consider repeated games in which each player can only react to the opponent’s move in the previous round (rather than the entire history of previous moves by both players) and the game ends with some small probability $\epsilon $ after each round.
2.
Construct a Nash equilibrium for that iterated game.
3.
Convert each of the strategies into a computer program that works as follows (see Algorithm 3): with probability $\epsilon $ do what the strategy does in the first round. With probability $1-\epsilon $, apply the opponent program to this program; then do what the underlying strategy would reply to the opponent program’s output.

The result is a program equilibrium which we have argued is more robust than the equilibria described by Tennenholtz (2004). More generally, we have shown that translating an individual’s strategy for the repeated game into a program for the stage game in the way described in step 3 retains many of the properties of the strategy for the repeated game. Thus, it seems that “good” programs to submit may be derived from “good” strategies for the repeated game.

Notes

The equilibrium in its rudimentary form had already been proposed by McAfee (1984) and Howard (1988). At least the idea of viewing players as programs with access to each other’s source code has also been discussed by, e.g., Binmore (1987, Sect. 5; 1988) and Anderlini (1990).
For keeping our notation simple, we will assume that our programs receive their own source code as input in addition to their opponent’s. If $ PROG _i(G)$ is sufficiently powerful, then by Kleene’s second recursion theorem, programs could also refer to their own source code without receiving it as an input (Cutland 1980, ch. 11).
This program was proposed by Abram Demski in a conversation discussing similar (but worse) ideas of mine. It has also been proposed by Jessica Taylor at https://agentfoundations.org/item?id=524, though in a slightly different context.
The tournament was announced at https://www.lesserwrong.com/posts/BY8kvyuLzMZJkwTHL/prisoner-s-dilemma-with-visible-source-code-tournament and the results at https://www.lesserwrong.com/posts/QP7Ne4KXKytj4Krkx/prisoner-s-dilemma-tournament-results-0.
For a more detailed analysis and report on my methodology, see https://casparoesterheld.files.wordpress.com/2018/02/transparentpdwriteup.pdf.

References

Anderlini, L. (1990). Some notes on Church’s thesis and the theory of games. Theory and Decision, 29(1), 19–52.
Article Google Scholar
Axelrod, R. (2006). The Evolution of Cooperation. New York: Basic Books.
Google Scholar
Barasz, M., et al. (2014). Robust cooperation in the prisoner’s dilemma: Program equilibrium via provability logic. arxiv: 1401.5577.
Binmore, K. (1987). Modeling rational players: Part I. Economics and Philosophy, 3(2), 179–214.
Article Google Scholar
Binmore, K. (1988). Modeling rational players: Part II. Economics and Philosophy, 4(1), 9–55.
Article Google Scholar
Critch, A. (2016). Parametric bounded Löb’s theorem and robust cooperation of bounded agents. arxiv:1602.04184.
Cutland, N. (1980). Computability: An introduction to recursive function theory. Cambridge: Cambridge University Press.
Book Google Scholar
van der Hoek, W., Witteveen, C., & Wooldridge, M. (2013). Program equilibrium—A program reasoning approach. International Journal of Game Theory, 42(3), 639–671.
Article Google Scholar
Howard, J. V. (1988). Cooperation in the prisoner’s dilemma. Theory and Decision, 24(3), 203–213.
Article Google Scholar
McAfee, R. P. (1984). Effective computability in economic decisions. http://www.mcafee.cc/Papers/PDF/EffectiveComputability.pdf. Accessed 10 Nov 2018.
Osborne, M. J. (2004). An introduction to game theory. Oxford: Oxford University Press.
Google Scholar
Tennenholtz, M. (2004). Program equilibrium. Games and Economic Behavior, 49(2), 363–373.
Article Google Scholar

Download references

Acknowledgements

I am indebted to Max Daniel and an anonymous referee for many helpful comments.

Author information

Caspar Oesterheld
Present address: Duke University, Durham, North Carolina, USA

Authors and Affiliations

Foundational Research Institute, Berlin, Germany
Caspar Oesterheld

Authors

Caspar Oesterheld
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Caspar Oesterheld.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Oesterheld, C. Robust program equilibrium. Theory Decis 86, 143–159 (2019). https://doi.org/10.1007/s11238-018-9679-3

Download citation

Published: 23 November 2018
Issue Date: 06 February 2019
DOI: https://doi.org/10.1007/s11238-018-9679-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robust program equilibrium

Abstract

Similar content being viewed by others

Quantal response equilibrium for the Prisoner’s Dilemma game in Markov strategies

Human players manage to extort more than the mutual cooperation payoff in repeated social dilemmas

Efficiency may improve when defectors exist

1 Introduction

2 Preliminaries

2.1 Strategic-form games

2.2 Program equilibrium

2.3 Repeated games

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

3 Robust program equilibrium in the prisoner’s dilemma

4 From iterated game strategies to robust program equilibria

4.1 Halting behavior

Theorem 4

Proof

4.2 Relationship to the underlying iterated game strategy

Theorem 5

Proof

Corollary 6

Proof

Corollary 7

Proof

4.3 Exploitability

Corollary 8

Proof

5 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation