Robust program equilibrium

One approach to achieving cooperation in the one-shot prisoner’s dilemma is Tennenholtz’s (Games Econ Behav 49(2):363–373, 2004) program equilibrium, in which the players of a game submit programs instead of strategies. These programs are then allowed to read each other’s source code to decide which action to take. As shown by Tennenholtz, cooperation is played in an equilibrium of this alternative game. In particular, he proposes that the two players submit the same version of the following program: cooperate if the opponent is an exact copy of this program and defect otherwise. Neither of the two players can benefit from submitting a different program. Unfortunately, this equilibrium is fragile and unlikely to be realized in practice. We thus propose a new, simple program to achieve more robust cooperative program equilibria: cooperate with some small probability ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document} and otherwise act as the opponent acts against this program. I argue that this program is similar to the tit for tat strategy for the iterated prisoner’s dilemma. Both “start” by cooperating and copy their opponent’s behavior from “the last round”. We then generalize this approach of turning strategies for the repeated version of a game into programs for the one-shot version of a game to other two-player games. We prove that the resulting programs inherit properties of the underlying strategy. This enables them to robustly and effectively elicit the same responses as the underlying strategy for the repeated game.


Introduction
Much has been written about rationalizing non-Nash equilibrium play in strategicform games.Most prominently, game theorists have discussed how cooperation may be achieved in the prisoner's dilemma, where mutual cooperation is not a Nash equilibrium but Pareto-superior to mutual defection.One of the most successful approaches is the repetition of a game, and in particular the iterated prisoner's dilemma (Axelrod 2006).
Another approach is to introduce commitment mechanisms of some sort.In this paper, we will discuss one particular commitment mechanism: Tennenholtz's (2004) program equilibrium formalism (Sect.2.2).Here, the idea is that in place of strategies, players submit programs which compute strategies and are given access to each other's source code.The programs can then encode credible commitments, such as some version of "if you cooperate, I will cooperate".
As desired, Tennenholtz (2004, Sect. 3, Theorem 1) shows that mutual cooperation is played in a program equilibrium of the prisoner's dilemma.However, Tennenholtz' equilibrium is very fragile.Essentially, it consists of two copies of a program that cooperates if it faces an exact copy of itself (cf. McAfee 1984;Howard 1988).Even small deviations from that program break the equilibrium.Thus, achieving cooperation in this way is only realistic if the players can communicate beforehand and settle on a particular outcome.
Another persuasive critique of this trivial equilibrium is that the model of two players submitting programs is only a metaphor, anyway.In real life, the programs may instead be the result of an evolutionary process (Binmore 1988 pp. 14f.) and Tennenholtz' equilibrium is a hard to obtain by such a process.Alternatively, if we view our theory as normative rather than descriptive, we may view the programs themselves as the target audience of our recommendations.This also means that these agents will already have some form of source code-e.g., one that derives and considers the program equilibria of the game-and it is out of their realm of power to change that source code to match some common standard.However, they may still decide on some procedure for thinking about this particular problem in such a way that enables cooperation with other rationally pre-programmed agents.
Noting the fragility of Tennenholtz' proposed equilibrium, it has been proposed to achieve a more robust program equilibrium by letting the programs reason about each other (van der Hoek et al. 2013;Barasz et al. 2014;Critch 2016).For example, Barasz et al. (2014, Sect.3) propose a program FairBot-variations of which we will see in this paper-that cooperates if Peano arithmetic can prove that the opponent cooperates against FairBot.FairBot cooperates (via Löb's theorem) more robustly against different versions of itself.These proposals are very elegant and certainly deserve further attention.However, their benefits come at the cost of being computationally expensive.
In this paper, I thus derive a class of program equilibria that I will argue to be more practical.In the case of the prisoner's dilemma, I propose a program that cooperates with a small probability and otherwise acts as the opponent acts against itself (see Algorithm 1).Doing what the opponent does-à la FairBot-incentivizes cooperation.Cooperating with a small probability allows us to avoid infinite loops that would arise if we merely predicted and copied our opponent's action (see Algorithm 2).This approach to a robust cooperation program equilibrium in the prisoner's dilemma is described in Sect.3.
We then go on to generalize the construction exemplified in the prisoner's dilemma (see Sect. 4).In particular, we show how strategies for the repeated version of a game can be used to construct good programs for the one-shot version of that game.We show that many of the properties of the underlying strategy of the repeated game carry over to the program for the stage game.We can thus construct "good" programs and program equilibria from "good" strategies and Nash equilibria.

Strategic-form games
For reference, we begin by introducing some basic terminology and formalism for strategic-form games.For an introduction, see, e.g., Osborne (2004).For reasons that will become apparent later on, we limit our treatment to two-player games.
A two-player strategic game G = (A 1 , A 2 , u 1 , u 2 ) consists of two countable sets of moves A i and for both players i ∈ {1, 2} a bounded utility function u (1) The expected value for player i given that strategy profile is Note that because the utility function is bounded, the sum converges absolutely, such that the order of the action pairs does not affect the sum's value.

Program equilibrium
We now introduce the concept of program equilibrium, first proposed by Tennenholtz (2004).The main idea is to replace strategies with computer programs that are given access to each other's source code. 1 The programs then give rise to strategies.
For any game G, we first need to define the set of program profiles PROG(G) consisting of pairs of programs.The ith entry of an element of PROG(G) must be a program source code p i that, when interpreted by a function apply, probabilistically map program profiles2 onto A i .
We require that for any program profile ( p 1 , p 2 ) ∈ PROG(G), both programs halt.Otherwise, the profile would not give rise to a well-defined strategy.Whether p i halts depends on the program p −i , it plays against, where (in accordance with convention in game theory) − : {1, 2} → {1, 2} : 1 → 2, 2 → 1 and we write −i instead of −(i).For example, if p i runs apply( p −i , (p i , p −i )), i.e., simulates the opponent, then that is fine as long as p −i does not also run apply( p i , (p i , p −i )), which would yield an infinite loop.To avoid this mutual dependence, we will generally require that PROG(G) = PROG 1 (G) × PROG 2 (G), where PROG i (G) consists of programs for player i. Methods of doing this while maintaining expressive power include hierarchies of players-e.g., higher indexed players are allowed to simulate lower indexed ones but not vice versa-hierarchies of programs-programs can only call their opponents with simpler programs as input-requiring programs to have a "plan B" if termination can otherwise not be guaranteed, or allowing each player to only start strictly less than one simulation in expectation.These methods may also be combined.In this paper, we do not assume any particular definition of PROG(G).However, we assume that they can perform arbitrary computations as long as these computations are guaranteed to halt regardless of the output of the parts of the code that do depend on the opponent program.We also require that PROG(G) is compatible with our constructions.We will show our constructions to be so benign in terms of infinite loops that this is not too strong of an assumption.

Repeated games
Our construction will involve strategies for the repeated version of a two-player game.Thus, for any game G, we define G to be the repetition of G with a probability of ∈ (0, 1] of ending after each round.Both players of G will be informed only of the last move of their opponent.This differs from the more typical assumption that players have access to the entire history of past moves.We will later see why this deviation is necessary.A strategy π i for player i non-deterministically maps opponent moves or the information of the lack thereof onto a move denotes the probability of choosing a given that the opponent played b in the previous round and denotes the probability of choosing a in the first round.We call a strategy a).The probability that the game follows a complete history of moves h = a 0 b 0 a 1 b 1 • • • a n b n and then ends is Note that the moves in the history always come in pairs a i b i which are chosen "simultaneously" in response to b i−1 and a i−1 , respectively.The expected value for player i given the strategy profile where The lax unordered summation in Eq. 5 is, again, unproblematic because of the absolute convergence of the series, which is a direct consequence of the proof of Lemma 1.Note how the organization of the history into pairs of moves allows us to apply the utility function of the stage game in Eq. 6.
For player i, we define the set-valued best response function as Analogously, B c i (π −i ) is the set of responses to π −i that are best among the computable ones, B s i (π −i ) the set of responses to π −i that are best among the stationary ones, and B s,c i (π −i ) the set of responses to π −i that are best among stationary computable strategies.A strategy profile We now prove a few lemmas that we will need later on.First, we have suggestively called the values P(h) probabilities, but we have not shown them to satisfy, say, Kolmogorov's axioms.Additivity is not an issue, because we have only defined the probability for atomic events and non-negativity is obvious from the definition.However, we will also need the fact that the numbers we have called probabilities indeed sum to 1, which requires a few lines to prove.
Lemma 1 Let G be a repeated game and π 1 , π 2 be strategies for that game.Then For seeing why the second-to-last equation is true, notice that the inner-most sum is 1.Thus, the next sum is 1 as well, and so on.Since the ordering in the right-hand side of the first line is lax, and because only the second line is known to converge absolutely, the re-ordering is best understood from right to left.The last step uses the well-known formula ∞ k=0 x k = 1/(1 − x) for the geometric series.
For any game G , k ∈ N + , a ∈ A 1 , b ∈ A 2 and strategies π 1 and π 2 for G , we define For k = 0, we define is the probability of reaching at least round k and that (a, b) is played in that round.With this should be the expected utility from the kth round (where not getting to the kth round counts as 0).This suggests a new way of calculating expected utilities on a more round-by-round basis.
Lemma 2 Let G be a game, and let π 1 , π 2 be strategies for that game.Then To find the probability of player i choosing a in round k, we usually have to calculate the probabilities of all actions in all previous rounds.After all, player i reacts to player −i's previous move, who in turn reacts to player i's move in round k − 2, and so on.This is what makes Eq. 7 so long.However, imagine that player −i uses a stationary strategy.This, of course, means that player −i's probability distribution over moves in round k (assuming the game indeed reaches round k) can be computed directly as π −i (b).Player i's distribution over moves in round k is almost as simple to calculate, because it only depends on player −i's distribution over moves in round k − 1, which can also be calculated directly.We hence get the following lemma.
Lemma 3 Let G be a game, let π i be a any strategy for G and let π −i be a stationary strategy for G .Then, for all k ∈ N + , it is Proof We conduct our proof by induction over k.For k = 1, it is If the lemma is true for k, it is also true for k + 1:

Robust program equilibrium in the prisoner's dilemma
Discussions of the program equilibrium have traditionally used the well-known prisoner's dilemma (or trivial variations thereof) as an example to show how the program equilibrium rationalizes cooperation where the Nash equilibrium fails (e.g., Tennenholtz 2004, Sect. 3;McAfee 1984;Howard 1988;Barasz et al. 2014).The present paper is no exception.In this section, we will present our main idea using the example of the prisoner's dilemma; the next section gives the more general construction and proofs of properties of that construction.For reference, the payoff matrix of the prisoner's dilemma is given in Table 1.
I propose to use the following decision rule: with a probability of ∈ (0, 1], cooperate.Otherwise, act as your opponent plays against you.I will call this strat- Algorithm 1: The GroundedFairBot for player i.The program makes use of a function sample which samples uniformly from the given interval or probability distribution.It is assumed that is computable.egy GroundedFairBot.A description of the algorithm in pseudo-code is given in Algorithm 1. 3The proposed program combines two main ideas.First, it is a version of FairBot (Barasz et al. 2014).That is, it chooses the action that its opponent would play against itself.As player −i would like player i to cooperate, GroundedFairBot thus incentivizes cooperation, as long as < 1/2.In this, it resembles the tit for tat strategy in the iterated prisoner's dilemma (IPD) (famously discussed by Axelrod 2006), which takes an empirical approach to behaving as the opponent behaves against itself.Here, the probability of the game ending must be sufficiently small (again, less than 1/2 for the given payoffs) in each round for the threat of punishment and the allure of reward to be persuasive reasons to cooperate.
The second main idea behind GroundedFairBot is that it cooperates with some small probability .First and foremost, this avoids running into the infinite loop that a naive implementation of FairBot-see Algorithm 2-runs into when playing against opponents who, in turn, try to simulate FairBot.Note, again, the resemblance with the tit for tat strategy in the iterated prisoner's dilemma, which cooperates when no information about the opponent's strategy is available.
Algorithm 2: The NaiveFairBot for player i To better understand how GroundedFairBot works, consider its behavior against a few different opponents.When GroundedFairBot faces NaiveFairBot, then both cooperate.For illustration, a dynamic call graph of their interaction is given in Fig. 1.It is left as an exercise for the reader to analyze GroundedFairBot's behavior against other programs, such as another instance of GroundedFairBot or a variation of GroundedFairBot that defects rather than cooperates with probability .
When playing against strategies that are also based on simulating their opponent, we could think of GroundedFairBot as playing a "mental IPD".If the opponent program decides whether to cooperate, it has to consider that it might currently only be simu- lated.Thus, it will choose an action with an eye toward gauging a favorable reaction from GroundedFairBot one recursion level up.Cooperation in the first "round" is an attempt to steer the mental IPD into a favorable direction, at the cost of cooperating if sample (0, 1) < already occurs in the first round.
In addition to proving theoretical results (as done below), it would be useful to test GroundedFairBot "in practice", i.e., against other proposed programs for the prisoner's dilemma with access to one another's source code.I only found one informal tournament for this version of the prisoner's dilemma.It was conducted in 2013 by Alex Mennen on the online forum and community blog LessWrong. 4In the original set of submissions, GroundedFairBot would have scored 6th out of 21.The reason why it is not a serious contender for first place is that it does not take advantage of the many exploitable submissions (such as bots that decide without looking at their opponent's source code).Once one removes the bottom 9 programs, GroundedFairBot scores second place.If one continues this process of eliminating unsuccessful programs for another two rounds, GroundedFairBot ends up among the four survivors that cooperate with each other.5

From iterated game strategies to robust program equilibria
We now generalize the construction from the previous section.Given any computable strategy π i for a sequential game G , I propose the following program: with a small probability sample from π i (0).Otherwise, act how π i would respond (in the sequential game G ) to the action that the opponent takes against this program.I will call this program Groundedπ i Bot.A description of the program in pseudo-code is given in Algorithm 3. As a special case, GroundedFairBot arises from Groundedπ i Bot by letting π i be tit for tat.
Algorithm 3: The Groundedπ i Bot for player i.The program makes use of a function sample which samples uniformly from a given interval or a given probability distribution.It is assumed that π i and are computable.
Again, our proposed program combines two main ideas.First, Groundedπ i Bot responds to how the opponent plays against Groundedπ i Bot.In this, it resembles the behavior of π i in G .As we will see, this leads Groundedπ i Bot to inherit many of π i 's properties.In particular, if (like tit for tat) π i uses some mechanism to incentivize its opponent to converge on a desired action, then Groundedπ i Bot incentivizes that action in a similar way.
Second, it-again-terminates immediately with some small probability to avoid the infinite loops that Naiveπ i Bot-see Algorithm 4-runs into.Playing π i (0) in particular is partly motivated by the "mental G " that Groundedπ i Bot plays against some opponents (such as Groundedπ −i Bots or Naiveπ −i Bots).The other motivation is to make the relationship between Groundedπ i Bot and π i cleaner.In terms of the strategies that are optimal against Groundedπ i Bot, the choice of that constant action cannot matter much if is small.Consider, again, the analogy with tit for tat.Even if tit for tat started with defection, one should still attempt to cooperate with it.In practice, however, it has turned out that the "nice" version of tit for tat (and nice strategies in general) are more successful (Axelrod 2006, ch. 2).The transparency in the program equilibrium may render such "signals of cooperativeness" less important-e.g., against programs like Barasz et al.'s (2014, Sect. 3) FairBot.Nevertheless, it seems plausible that-if only because of mental G -related considerations-in transparent games the "initial" actions matter as well.
Algorithm 4: The Naiveπ i Bot for player i.It is assumed that π i is computable.
We now ground these two intuitions formally.First, we discuss Groundedπ i Bot's halting behavior.We then show that, in some sense, Groundedπ i Bot behaves in G like π i does in G .

Halting behavior
For a program to be a viable option in the "transparent" version of G, it should halt against a wide variety of opponents.Otherwise, it may be excluded from PROG(G) in our formalism.Besides, it should be efficient enough to be practically useful.As with GroundedFairBot, the main reason why Groundedπ i Bot is benign in terms of the risk of infinite loops is that it generates strictly less than one new function call in expectation and never starts more than one.While we have no formal machinery for analyzing the "loop risk" of a program, it is easy to show the following theorem.
Theorem 4 Let π i be a computable strategy for a game G .Furthermore, let p −i be any program (not necessarily in PROG i (G)) that calls apply( p i , (p i , p −i )) at most once and halts with probability 1 if apply halts with probability 1.Then Groundedπ i Bot and p −i halt against each other with probability 1 and the expected number of steps required for executing Groundedπ i Bot is at most where T π i is the maximum number of steps to sample from π i , and T p −i is the maximum number of steps needed to sample from apply( p −i , ( Groundedπ i Bot, p −i )) excluding the steps needed to execute apply( Groundedπ i Bot, ( Groundedπ i Bot, p −i )).
Proof It suffices to discuss the cases in which p −i calls p i once with certainty, because if any of our claims were refuted by some program p −i , they would also be refuted by a version of that program that calls p i once with certainty.If p −i calls p i once with certainty, then the dynamic call graphs of both Groundedπ i Bot and p −i look similar to the one drawn in 1.In particular, it only contains one infinite path and that path has a probability of at most lim i→∞ (1 − ) i = 0.
For the time complexity, we can consider the dynamic call graph as well.The policy π i has to be executed at least once (with probability with the input 0 and with probability 1− against an action sampled from apply( p −i , ( Groundedπ i Bot, p −i )).With a probability of (1 − ), we also have to execute the non-simulation part of p −i and, for a second time, π i .And so forth.The expected number of steps to execute Groundedπ i Bot is thus which can be shown to be equal to the term in 8 by using the well-known formula ∞ k=0 x k = 1/(1 − x) for the geometric series.

123
Note that this argument would not work if there were than two players or if the strategy for the iterated game were to depend on more than just the last opponent move, because in these cases, the natural extension of Groundedπ i Bot would have to make multiple calls to other programs.Indeed, this is one of the reasons why the present paper only discusses two-player games and iterated games with such shortterm memory.Whether a similar result can nonetheless be obtained for more than 2 players and strategies that depend on the entire past history is left to future research.
As special cases, for any strategy π −i , Groundedπ i Bot terminates against Groundedπ −i Bot and Naiveπ −i Bot (and these programs in turn terminate against Groundedπ i Bot).The latter is especially remarkable.Our Groundedπ i Bot terminates and leads the opponent to terminate even if the opponent is so careless that it would not even terminate against a version of itself or, in our formalism, if PROG −i (G) gives the opponent more leeway to work with simulations.

Relationship to the underlying iterated game strategy
Theorem 5 Let G be a game, π i be a strategy for player i in G , p i = Groundedπ i Bot and p −i ∈ PROG −i (G) be any opponent program.We define Proof We separately transform the two expected values in the equation that is to be proven and then notice that they only differ by a factor : The second-to-last step uses convergence to reorder the sum signs.The last step uses the well-known formula ∞ k=0 x k = 1/(1 − x) for the geometric series.Onto the other expected value: Here, apply( p i , (p i , p −i ), a) := apply( p i , (p i , p −i ))(a).The hypothesis follows immediately.
Note that the program side of the proof does not involve any "mental G ". Using Theorem 5, we can easily prove a number of property transfers from π i to Groundedπ i Bot.
Corollary 6 Let G be a game.Let π i be a computable strategy for player i in G and let p i = Groundedπ i Bot.

If p
Proof Both 1. and 2. follow directly from Theorem 5.
Intuitively speaking, Corollary 6 shows that π i and Groundedπ i Bot provoke the same best responses.Note that best responses in the program game only correspond to best stationary computable best responses in the repeated game.The computability requirement is due to the fact that programs cannot imitate incomputable best responses.The corresponding strategies for the repeated game further have to be stationary because Groundedπ i Bot only incentivizes opponent behavior for a single situation, namely the situation of playing against Groundedπ i Bot.As a special case of Corollary 6, if < 1/2, the best response to GroundedFairBot is a program that cooperates against GroundedFairBot because in a IPD with a probability of ending of less than 1/2 a program that cooperates is the best (stationary computable) response to tit for tat.
Corollary 7 Let G be a game.If (π 1 , π 2 ) is a Nash equilibrium of G and π 1 and π 2 are computable, then ( Groundedπ 1 Bot, Groundedπ 2 Bot) is a program equilibrium of G.
Proof Follows directly from Theorem 5.

Exploitability
Besides forming an equilibrium against many opponents itself) incentivizing cooperation, another important reason for tit for tat's success is that it is "not very exploitable" (Axelrod 2006).That is, when playing against tit for tat, it is impossible to receive a much higher reward than tit for tat itself.We now show that (in)exploitability transfers from strategies π i to Groundedπ i Bots.
We call a game G = (A 1 , A 2 , u 1 , u 2 ) symmetric if A 1 = A 2 and u 1 (a, b) = u 2 (b, a) for all a ∈ A 1 and b ∈ A 2 .If G is symmetric, we call a strategy π i for G N -exploitable in G for an N ∈ R ≥0 if there exists a π −i , such that We call π i N -inexploitable if it is not N -exploitable.
Analogously, in a symmetric game G we call a program p i N -exploitable for an N ∈ R ≥0 if there exists a p −i , such that We call p i N -inexploitable if it is not N -exploitable.
Corollary 8 Let G be a game and π i be an N -inexploitable strategy for G .Then Groundedπ i Bot is N -inexploitable.
Proof Follows directly from Theorem 5.
Notice that if -like tit for tat in the IPD -π i is N -inexploitable in G for all , then we can make Groundedπ i Bot arbitrarily close to 0-inexploitable by decreasing .

Conclusion
In this paper, we gave the following recipe for constructing a program equilibrium for a given two-player game: 1. Construct the game's corresponding repeated game.In particular, we consider repeated games in which each player can only react to the opponent's move in the previous round (rather than the entire history of previous moves by both players) and the game ends with some small probability after each round.2. Construct a Nash equilibrium for that iterated game.The result is a program equilibrium which we have argued is more robust than the equilibria described by Tennenholtz (2004).More generally, we have shown that translating an individual's strategy for the repeated game into a program for the stage game in the way described step 3 retains many of the properties of the strategy for the repeated Thus, it seems that "good" programs to submit may be derived from "good" strategies for the repeated game.
3. Convert each of the strategies into a computer program that works as follows (see Algorithm 3): with probability do what the strategy does in the first round.With probability 1 − , apply the opponent program to this program; then do what the underlying strategy would reply to the opponent program's output.