The Impatient May Use Limited Optimism to Minimize Regret

Discounted-sum games provide a formal model for the study of reinforcement learning, where the agent is enticed to get rewards early since later rewards are discounted. When the agent interacts with the environment, she may regret her actions, realizing that a previous choice was suboptimal given the behavior of the environment. The main contribution of this paper is a PSPACE algorithm for computing the minimum possible regret of a given game. To this end, several results of independent interest are shown. (1) We identify a class of regret-minimizing and admissible strategies that first assume that the environment is collaborating, then assume it is adversarial---the precise timing of the switch is key here. (2) Disregarding the computational cost of numerical analysis, we provide an NP algorithm that checks that the regret entailed by a given time-switching strategy exceeds a given value. (3) We show that determining whether a strategy minimizes regret is decidable in PSPACE.


Introduction
A pervasive model used to study the strategies of an agent in an unknown environment is two-player infinite horizon games played on finite weighted graphs. Therein, the set of vertices of a graph is split between two players, Adam and Eve, playing the roles of the environment and the agent, respectively. The play starts in a specific vertex, and each player decides where to go next when the play reaches one of their vertices. Questions asked about these games are usually of the form: Does there exist a strategy of Eve such that. . . ? For such a question to be well-formed, one should provide: The valuation function can be Boolean, in which case one says that Eve wins or loses (one very classical example has Eve winning if the maximum value appearing infinitely often along the edges is even). In this setting, it is often assumed that Adam is adversarial, and the question then becomes: Can Eve always win? (The names of the players stem from this view: is there a strategy of ∃ve that always beats ∀dam?) The literature on that subject spans more than 35 years, with newly found applications to this day (see [3] for comprehensive lecture notes, and [7] for an example of recent use in the analysis of attacks in cryptocurrencies).
The valuation function can also aggregate the numerical values along the edges into a reward value. We focus in this paper on discounted sum: if w is the weight of the edge taken at the n-th step, Eve's reward grows by λ n · w, where λ ∈ (0, 1) is a prescribed discount factor. Discounting future rewards is a classical notion used in economics [18], Markov decision processes [16,9], systems theory [8], and is at the heart of Q-learning, a reinforcement learning technique widely used in machine learning [19]. In this setting, we consider three attitudes towards the environment: 1. The adversarial environment hypothesis translates to Adam trying to minimize Eve's reward, and the question becomes: Can Eve always achieve a reward of x? This problem is in NP ∩ coNP [20] and showing a P upperbound would constitute a major breakthrough (namely, it would imply the same for so-called parity games [15]). A strategy of Eve that maximizes her rewards against an adversarial environment is called worst-case optimal.
Conversely, a strategy that maximizes her rewards assuming a collaborative environment is called best-case optimal. 2. Assuming that the environment is adversarial is drastic, if not pessimistic.
Eve could rather be interested in settling for a strategy σ which is not consistently bad: if another strategy σ ′ gives a better reward in one environment, there should be another environment for which σ is better than σ ′ . Such strategies, called admissible [5], can be seen as an a priori rational choice. 3. Finally, Eve could put no assumption on the environment, but regret not having done so. Formally, the regret value of Eve's strategy is defined as the maximal difference, for all environments, between the best value Eve could have obtained and the value she actually obtained. Eve can thus be interested in following a strategy that achieves the minimal regret value, aptly called a regret-minimal strategy [10]. This constitutes an a posteriori rational choice [12]. Regret-minimal strategies were explored in several contexts, with applications including competitive online algorithm synthesis and robot-motion planning [2,11,13,14].
In this paper, we single out a class of strategies for Eve that first follow a best-case optimal strategy, then switch to a worst-case optimal strategy after some precise time; we call these strategies optipess. Our main contributions are then: 1. Optipess strategies are not only regret-minimal (a fact established in [13]) but also admissible-note that there are regret-minimal strategies that are not admissible and vice versa (see Appendix). On the way, we show that for any strategy of Eve there is an admissible strategy that performs at least as well; this is a peculiarity of discounted-sum games. 2. The regret value of a given time-switching strategy can be computed with an NP algorithm (disregarding the cost of numerical analysis). The main technical hurdle is showing that exponentially long paths can be represented succinctly, a result of independent interest. 3. The question Can Eve's regret be bounded by x? is decidable in NP coNP , improving on the implicit NExp algorithm of [13]. The algorithm consists in guessing a time-switching strategy and computing its regret value; since optipess strategies are time-switching strategies that are regret-minimal, the algorithm will eventually find the minimal regret value of the input game.
Notations and definitions are introduced in Section 2. The study of admissible regret-minimal strategies is done in Section 3. In Section 4, we provide an important lemma that allows to represent long paths succinctly. In Section 5, we argue that the important values of a game (regret, best-case, worst-case) have short witnesses. Finally, in Section 6, we rely on these lemmas to present our new algorithms.

Preliminaries
We assume familiarity with basic graph and complexity theory. Some more specific definitions and known results are recalled here.
where V is a finite set of vertices, v 0 is the starting vertex, V ∃ ⊆ V is the subset of vertices that belong to Eve, E ⊆ V × V is a set of directed edges, w : E → Z is an (edge-)weight function, and 0 < λ < 1 is a rational discount factor. The vertices in V \ V ∃ are said to belong to Adam. Since we consider games played for an infinite number of turns, we will always assume that every vertex has at least one outgoing edge.
A play is an infinite path v 1 v 2 · · · ∈ V ω in the digraph (V, E). A history h = v 1 · · · v n is a finite path. The length of h, written |h|, is the number of edges it contains: |h| def = n − 1. The set Hist consists of all histories that start in v 0 and end in a vertex from V ∃ .

Strategies.
A strategy of Eve in G is a function σ that maps histories ending in

Strategies of Adam are defined similarly.
A history h = v 1 · · · v n is said to be consistent with a strategy σ of Eve if for all i ≥ 2 such that v i ∈ V ∃ , we have that σ(v 1 · · · v i−1 ) = v i . Consistency with strategies of Adam is defined similarly. We write Hist(σ) for the set of histories in Hist that are consistent with σ. A play is consistent with a strategy (of either player) if all its prefixes are consistent with it.
Given a vertex v and both Adam and Eve's strategies, τ and σ respectively, there is a unique play starting in v that is consistent with both, called the outcome of τ and σ on v. This play is denoted out v (σ, τ ).
For a strategy σ of Eve and a history h ∈ Hist(σ), we let σ h be the strategy of Eve that assumes h has already been played. Formally, σ h (h ′ ) = σ(h · h ′ ) for any history h ′ (we will use this notation only on histories h ′ that start with the ending vertex of h). Values. The value of a history h = v 1 · · · v n is the discounted sum of the weights on the edges: The value of a play is simply the limit of the values of its prefixes.
The antagonistic value of a strategy σ of Eve with history h = v 1 · · · v n is the value Eve achieves when Adam tries to hinder her, after h: where τ ranges over all strategies of Adam. The collaborative value cVal h (σ) is defined in a similar way, by substituting "sup" for "inf." We write aVal h (resp. cVal h ) for the best antagonistic (resp. collaborative) value achievable by Eve with any strategy.
Types of strategies. A strategy σ of Eve is strongly worst-case optimal (SWO) if for every history h we have aVal h (σ) = aVal h ; it is strongly best-case optimal (SBO) if for every history h we have cVal h (σ) = cVal h .
We single out a class of SWO strategies that perform well if Adam turns out to be helping. A SWO strategy σ of Eve is strongly best worst-case optimal (SBWO) if for every history h we have cVal h (σ) = acVal h , where: In the context of discounted-sum games, strategies that are positional and strongly optimal always exist. Furthermore, the set of all such strategies can be characterized by local conditions.  Regret. The regret of a strategy σ of Eve is the maximal difference between the value obtained by using σ and the value obtained by using an alternative strategy: where τ and σ ′ range over all strategies of Adam and Eve, respectively. The (minimal) regret of G is then Reg Regret can also be characterized by considering the point in history when Eve should have done things differently. Formally, for any vertices u and v let cVal u ¬v be the maximal cVal u (σ) for strategies σ verifying σ(u) = v. Then: Lemma 13]). For all strategies σ of Eve: Switching and optipess strategies. Given strategies σ 1 , σ 2 of Eve and a threshold function t : V ∃ → N ∪ {∞}, we define the switching strategy σ 1 t →σ 2 for any history h = v 1 · · · v n ending in V ∃ as: We refer to histories for which the first condition above holds as switched histories, to all others as unswitched histories. The strategy is said to be bipositional if both σ 1 and σ 2 are positional. Note that in that case, if h is switched then σ h = σ 2 , and otherwise σ h is the same as σ but with t(v) changed to max{0, t(v)− |h|} for all v ∈ V ∃ . In particular, if |h| is greater than max{t(v) < ∞}, then σ h is nearly positional: it switches to σ 2 as soon as it sees a vertex with t(v) = ∞. A strategy σ is perfectly optimistic-then-pessimistic (optipess, for short) if there are positional SBO and SBWO strategies σ sbo and σ sbwo such that σ = Eve's strategy has a regret value of 2λ 2 /(1 − λ). This is realized when Adam plays from v 0 to v 1 , from v ′′ 1 to x, and from v ′ 1 to y. Against that strategy, Eve ensures a discounted-sum value of 0 by playing according to σ while regretting not having played to v ′′ 1 to obtain 2λ 2 /(1 − λ).

Admissible strategies and regret
There is no reason for Eve to choose a strategy that is consistently worse than another one. This classical notion is formalized as follows: Definition 1. Let σ 1 , σ 2 be two strategies of Eve. We say that σ 1 is weakly dominated by σ 2 if Val(out v0 (σ 1 , τ )) ≤ Val(out v0 (σ 2 , τ )) for every strategy τ of Adam. We say that σ 1 is dominated by σ 2 if σ 1 is weakly dominated by σ 2 but not conversely. A strategy σ of Eve is admissible if it is not dominated by any other strategy. This strategy guarantees a discounted-sum value of 6λ 2 (1 − λ) against any strategy of Adam. Furthermore, it is worst-case optimal since playing to v 1 instead of v 2 would allow Adam the opportunity to ensure a strictly smaller value by playing to v ′′ 1 . The latter also implies that σ is admissible. Interestingly, playing to v 1 is also an admissible behavior of Eve since, against a strategy of Adam that plays from v 1 to v ′ 1 , it obtains 10λ 2 (1 − λ) > 6λ 2 (1 − λ).
In this section, we show that (1) Any strategy is weakly dominated by an admissible strategy; (2) Being dominated entails more regret; (3) Optipess strategies are both regret-minimal and admissible. We will need the following: . A strategy σ of Eve is admissible if and only if for every history h ∈ Hist(σ) the following holds: either cVal h (σ) > aVal h or aVal h (σ) = cVal h (σ) = aVal h = acVal h .
The above characterization of admissible strategies in so-called well-formed games was proved in [6,Theorem 11]. Lemma 3 follows from the fact that discounted-sum games are well-formed (see Appendix, Section E).

Any strategy is weakly dominated by an admissible strategy
We show that discounted-sum games have the distinctive property that every strategy is weakly dominated by an admissible strategy. This is in stark contrast with most cases where admissibility has been studied previously [6].
Theorem 2. Any strategy of Eve is weakly dominated by an admissible strategy.
Proof (Sketch). The main idea is to construct, based on σ, a strategy σ ′ that will switch to a SBWO strategy as soon as σ does not satisfy the characterization of Lemma 3. The first argument consists in showing that σ is indeed weakly dominated by σ ′ . This is easily done by comparing, against each strategy τ of Adam, the values of σ and σ ′ . The second argument consists in verifying that σ ′ is indeed admissible. This is done by checking that each history h consistent with σ ′ satisfies the characterization of Lemma 3, that is cVal h (σ ′ ) > aVal h or aVal h (σ ′ ) = cVal h (σ ′ ) = aVal h = acVal h . If σ ′ is already following an SBWO strategy at h, then the definition of SBWO strategies ensures that aVal h (σ ′ ) = aVal h and cVal h (σ ′ ) = acVal h , and the part of the characterization satisfied only depends whether acVal h > aVal h . If σ ′ is still following σ at h, the reasoning relies on the facts that σ ′ weakly dominates σ and that σ satisfies the characterization of Lemma 3 until h. This is true because σ ′ and σ agree up to h. In the case where cVal h (σ) > aVal h , the weak dominance of σ by σ ′ implies that cVal h (σ ′ ) ≥ cVal h (σ) and thus that cVal h (σ ′ ) > aVal h . In the case where aVal h (σ) = cVal h (σ) = aVal h = acVal h , the weak dominance of σ by σ ′ implies: Since the first point shows that σ ′ is worst-case optimal at h, we know, by definition of acVal h , that cVal h (σ ′ ) ≤ acVal h . Combined with the second point, we get that cVal h (σ ′ ) = acVal h and thus that aVal h (σ ′ ) = cVal h (σ ′ ) = aVal h = acVal h . ⊓ ⊔

Being dominated is regretful
Theorem 3. For all strategies σ, σ ′ of Eve such that σ is weakly dominated by Proof. Let σ, σ ′ be such that σ is weakly dominated by σ ′ . This means that for every strategy τ of Adam, we have that Val(π) ≤ Val(π ′ ) where π = out v0 (σ, τ ) and π ′ = out v0 (σ ′ , τ ). Consequently, we obtain As this holds for any τ , we can conclude that sup The converse of the lemma is however false.

Optipess strategies are both regret-minimal and admissible
Recall that there are admissible strategies that are not regret-minimal and vice versa (see Appendix, Section A). However, as a direct consequence of Theorem 2 and Theorem 3, there always exist regret-minimal admissible strategies. It turns out that optipess strategies, which are regret-minimal (Theorem 1), are also admissible: Theorem 4. All optipess strategies of Eve are admissible.
Proof. Let σ sbo and σ sbwo be positional SBO and SBWO strategies of Eve, σ be an optipess strategy of Eve with σ = σ sbo t →σ sbwo , and let h = v 0 . . . v n ∈ Hist(σ) be a history consistent with σ.
By definition of optipess strategies, we know that σ and σ sbo agree up to h, and, in particular, that σ(h) = σ sbo (h). Furthermore, we know that cVal h > aVal h , otherwise at v n we have λ n (cVal vn −aVal vn ) = 0, which is necessarily smaller or equal to Reg. Let us show that cVal h (σ) > aVal h , thus satisfying the first case of the characterization. Assume, towards contradiction, that cVal h (σ) ≤ aVal h . Let τ be a strategy of Adam such that h is consistent with the outcome of τ and σ, and the value of the outcome of τ and σ sbo is cVal h . (Such strategy and outcome indeed exist because h is consistent with σ sbo and discounted-sum value functions are continuous, see Appendix E for more details.) By definition of the regret, we have that On the other hand, since the strategy σ is regret-minimizing, it holds that Reg (σ) = Reg. Hence, Reg (σ) < λ n (cVal vn − aVal vn ). But we also have cVal h −aVal h = (Val(h)+λ n cVal vn )− (Val(h) + λ n aVal vn ) = λ n (cVal vn − aVal vn ). We thus get a contradiction: ⊓ ⊔

Minimal values are witnessed by a single iterated cycle
We start our technical work towards a better algorithm to compute the regret value of a game. In this section, we show a crucial lemma on representing long histories: there are histories of a simple shape that witness small values in the game.
More specifically, we show that for any history h, there is another history h ′ of the same length that has smaller value and such that h ′ = α·β k ·γ where |αβγ| is small. This will allow us to find the smallest possible value among exponentially large histories by guessing α, β, γ, and k, which will all be small. This property holds for a wealth of different valuation functions, hinting at possible further applications. Namely, the only requirement is the following: Lemma 4. For any history h = α · β · γ with α and γ same-length cycles: Within the proof of the key lemma of this section, and later on when we use it (Lemma 9), we will rely on the following elementary notion of cycle decomposition: Definition 2. A simple-cycle decomposition (SCD) is a pair consisting of paths and iterated simple cycles. Formally, an SCD is a pair D = (α i ) n i=0 , (β j , k j ) n j=1 , where each α i is a path, each β j is a simple cycle, and each k j is a positive integer. We write D(j) = β kj j · α j and D(⋆) = α 0 · D(1)D(2) · · · D(n). By carefully iterating Lemma 4, we have: For any history h there exists an history h ′ = α · β k · γ with: h and h ′ have the same starting and ending vertices, and the same length; Proof. In this proof, we focus on SCDs for which each path α i is simple; we call them ßCDs. We define a wellfounded partial order on ßCDs. Let be two ßCDs; we write D ′ < D iff all the following holds: -D(⋆) and D ′ (⋆) have the same starting and ending vertices, the same length, and satisfy Val(D ′ (⋆)) ≤ Val(D(⋆)) and n ′ ≤ n; That this order has no infinite descending chain is clear. We show two claims: 1. Any ßCD with n greater than |V | has a smaller ßCD; 2. Any ßCD with two k j , k j ′ > |V | has a smaller ßCD.
Together they imply that for a smallest ßCD D, D(⋆) is of the required form. Indeed let j be the unique value for which k j > |V |, then the statement of the Lemma is satisfied by letting Claim 1. Suppose D has n > |V |. Since all cycles are simple, there are two cycles β j , β j ′ , j < j ′ , of same length. We can apply Lemma 4 on the path β j · (α j D(j + 1) · · · D(j ′ − 1)) · β j ′ , and remove one of the two cycles while duplicating the other; we thus obtain a similar path of smaller value. This can be done repeatedly until we obtain a path with only one of the two cycles, say β j ′ , the other case being similar. Substituting this path in D(⋆) results in: This gives rise to a smaller ßCD as follows. If α j−1 α j is still a simple path, then the above history is expressible as an ßCD with a smaller number of cycles.
Since each cycle in the ßCD is simple, k j and k j ′ are greater than both |β j | and |β j ′ |; let us write k j = b|β j ′ | + r with 0 ≤ r < |β j ′ |, and similarly, k j ′ = b ′ |β j | + r ′ . We have: Noting that β |βj| j ′ and β |β j ′ | j are cycles of the same length, we can transfer all the occurrences of one to the other, as in Claim 1. Similarly, if two simple paths get merged and give rise to a cycle, a smaller ßCD can be constructed; if not, then there are now at most r < |V | occurrences of β j ′ (or conversely, r ′ of β j ), again resulting in a smaller ßCD.

Short witnesses for regret, antagonistic, and collaborative values
We continue our technical work towards our algorithm for computing the regret value. In this section, the overarching theme is that of short witnesses. We show that (1) The regret value of a strategy is witnessed by histories of bounded length; (2) The collaborative value of a game is witnessed by a simple path and an iterated cycle; (3) The antagonistic value of a strategy is witnessed by an SCD and an iterated cycle.

Regret is witnessed by histories of bounded length
For any bipositional switching strategy σ of Eve, we have: Proof. Consider a history h of length greater than C, and write Note that one of p or p ′ is longer than |V |-say p, the other case being similar. This implies that there is a cycle in p, i.e., p = α · β · γ with β a cycle. Let h ′ = h 1 · α · γ · p ′ ; this history has the same starting and ending vertex as h. Moreover, since |h 1 | is larger than any value of the threshold function, σ h = σ h ′ . Lastly, h ′ is still in Hist(σ), since the removed cycle did not play a role in switching strategy. This shows: Since the length of h is greater than the length of h ′ , the discounted value for h ′ will be greater than that of h, resulting in a bigger regret value. There is thus no need to consider histories of size greater than C.
⊓ ⊔ It may seem from this lemma and the fact that t(v) may be very large that we will need to guess histories of important length. However, since we will be considering bipositional switching strategies, we will only be interested in some properties of the histories that are not hard to verify: The following problem is decidable in NP: Given: A game, a bipositional switching strategy σ, a number n in binary, a Boolean b, and two vertices v, v ′ Question: Proof. This is done by guessing multiple flows within the graph (V, E). Here, we call flow a valuation of the edges E by integers, that describes the number of times a path crosses each edge. Given a vector in N E , it is not hard to check that there is a path that it represents, and to extract the initial and final vertices of that path [17]. We first order the different thresholds from the strategy σ = σ 1 t →σ 2 : let We analyze the structure of histories consistent with σ. Let h ∈ Hist(σ), and write h = h ′ · h ′′ where h ′ is the maximal unswitched prefix of h. Naturally, h ′ is consistent with σ 1 and h ′′ is consistent with σ 2 . Then h ′ = h 0 h 1 · · · h i , for some i < |V ∃ |, with: To check the existence of a history with the given parameters, it is thus sufficient to guess the value i ≤ |V ∃ |, and to guess i connected flows (rather than paths) with the above properties that are consistent with σ 1 . Finally, we guess a flow for h ′′ consistent with σ 2 if we need a switched history, and verify that it is starting at a switching vertex. The flows must sum to n + 1, with the last vertex being v ′ , and the previous v.

Short witnesses for the collaborative and antagonistic values
Lemma 8. There is a set P of pairs (α, β) with α a simple path and β a simple cycle such that: Additionally, membership in P is decidable in polynomial time w.r.t. the game.
Proof. This is a consequence of Lemma 1: Consider positional SBO strategies τ and σ of Adam and Eve, respectively. Since they are positional, the path out v0 (σ, τ ) is of the form α · β ω , as required, and its value is cVal v0 . Moreover, it can be easily checked that, given a pair (α, β), there exists a pair of strategies with outcome α · β ω . If that holds, the value Val(α · β ω ) will be at most cVal v0 .
⊓ ⊔ Lemma 9. Let σ be a bipositional switching strategy of Eve. There is a set K of pairs (D, β) with D an SCD and β a simple cycle such that: Additionally, the size of each pair is polynomially bounded, and membership in K is decidable in polynomial time w.r.t. σ and the game.
Proof. Let C = max{t(v) < ∞}, and consider a play π consistent with σ that achieves the value aVal v0 (σ). Write π = h · π ′ with |h| = C, and let v be the final vertex of h. Naturally: We first show how to replace π ′ by some α · β ω , with α a simple path and β a simple cycle. First, since π witnesses aVal v0 (σ), we have that Val(π ′ ) = aVal v (σ h ). Now σ h is positional, because |h| ≥ C. 4 It is known that there are optimal positional antagonistic strategies τ for Adam, that is, that satisfy aVal v (σ h ) = out v (σ h , τ ). As in the proof of Lemma 8, this implies that aVal v (σ h ) = Val(α · β ω ) = Val(π ′ ) for some α and β; additionally, any (α, β) that are consistent with σ h and a potential strategy for Adam will give rise to a bigger value.
We now argue that Val(h) is witnessed by an SCD of polynomial size. This bears similarity to the proof of Lemma 7. Specifically, we will reuse the fact that histories consistent with σ can be split into histories played "between thresholds." Let us write σ = σ 1 We now diverge from the proof of Lemma 7. We apply Lemma 5 on each h j in the game where the strategy σ 1 is hardcoded (that is, we first remove every edge (u, v) ∈ V ∃ × V that does not satisfy σ 1 (u) = v). We obtain a history h ′ 0 h ′ 1 · · · h ′ i that is still in Hist(σ), thanks to the previous splitting of h. We also apply Lemma 5 to h ′ , this time in the game where σ 2 is hardcoded, obtaining h ′′ . Since each h ′ j and h ′′ are expressed as α · β k · γ, there is an SCD D with no more than |V ∃ | elements that satisfies Val(D(⋆)) ≤ Val(h)-naturally, since Val(h) is minimal and D(⋆) ∈ Hist(σ), this means that the two values are equal. Note that it is not hard, given an SCD D, to check whether D(⋆) ∈ Hist(σ), and that SCDs that are not valued Val(h) have a bigger value.

The complexity of regret
We are finally equipped to present our algorithms. To account for the cost of numerical analysis, we rely on the problem PosSLP [1]. This problem consists in determining whether an arithmetic circuit with addition, subtraction, and multiplication gates, together with input values, evaluate to a positive integer.
PosSLP is known to be decidable in the so-called counting hierarchy, itself contained in the set of problems decidable using polynomial space.
Theorem 5. The following problem is decidable in NP PosSLP : Given: A game, a bipositional switching strategy σ, a value r ∈ Q in binary Question: Is Reg (σ) > r?
Proof. Let us write σ = σ 1 t →σ 2 . Lemma 6 indicates that Reg (σ) > r holds if there is a history h of some length n ≤ C = 2|V | + max{t(v) < ∞}, ending in some v n such that: Note that since σ is bipositional, we do not need to know everything about h. Indeed, the following suffice: its length n, final vertex v n , v ′ = σ(h), and whether it is switched. Rather than guessing h, we can thus rely on Lemma 7 to get the required information. We start by simulating the NP machine that this lemma provides, and verify that n, v n , and v are consistent with a potential history.
Let us now concentrate on the collaborative value that we need to evaluate in Equation 1. To compute cVal, we rely on Lemma 8, which we apply in the game where v n is set initial, and its successor forced not to be v. We guess a pair (α c , β c ) ∈ P ; we thus have Val(α c · β ω c ) ≤ cVal vn ¬σ(h) , with at least one guessed pair (α c , β c ) reaching that latter value.
Let us now focus on computing aVal vn (σ h ). Since σ is a bipositional switching strategy, σ h is simply σ where t(v) is changed to max{0, t(v) − n}. Lemma 9 can thus be used to compute our value. To do so, we guess a pair (D, β a ) ∈ K; we thus have Val(D(⋆) · β ω a ) ≥ aVal vn (σ h ), and at least one pair (D, β a ) reaches that latter value.
Our guesses satisfy: and there is a choice of our guessed paths and SCD that gives exactly the lefthand side. Comparing the left-hand side with r can be done using an oracle to PosSLP (see Appendix, Section G), concluding the proof. ⊓ ⊔ Theorem 6. The following problem is decidable in coNP NP PosSLP : Given: A game, a value r ∈ Q in binary Question: Is Reg > r?
Proof. To decide the problem at hand, we ought to check that every strategy has a regret value greater than r. However, optipess strategies being regret-minimal, we need only check this for a class of strategies that contains optipess strategies: bipositional switching strategies form one such class.
What is left to show is that optipess strategies can be encoded in polynomial space. Naturally, the two positional strategies contained in an optipess strategy can be encoded succinctly. We thus only need to show that, with t as in the definition of optipess strategies (page 5), t(v) is at most exponential for every v ∈ V ∃ with t(v) ∈ N. This is shown in Appendix, Section H.
⊓ ⊔ Theorem 7. The following problem is decidable in coNP NP PosSLP : Given: A game, a bipositional switching strategy σ Question: Is σ regret optimal? Proof. A consequence of the proof of Theorem 5 and the existence of optipess strategies is that the value Reg of a game can be computed by a polynomial size arithmetic circuit. Moreover, our reliance on PosSLP allows the input r Theorem 5 to be represented as an arithmetic circuit without impacting the complexity. We can thus verify that for all bipositional switching strategies σ ′ (with sufficiently large threshold functions) and all possible polynomial size arithmetic circuits, Reg(σ) > r implies that Reg(σ ′ ) > r. The latter holds if and only if σ is regret optimal since, as we have argued in the proof of Theorem 6, such strategies σ ′ include optipess strategies and thus regret-minimal strategies. ⊓ ⊔

Conclusion
We studied regret, a notion of interest for an agent that does not want to assume that the environment she plays in is simply adversarial. We showed that there are strategies that both minimize regret, and are not consistently worse than any other strategies. The problem of computing the minimum regret value of a game was then explored, and a better algorithm was provided for it.
The exact complexity of this problem remains however open. The only known lower bound, a straightforward adaptation of [14, Lemma 3] for discounted-sum games, shows that it is at least as hard as solving parity games [15]. Our upper bound could be significantly improved if we could efficiently solve the following problem: Given: ∈ N n , and r ∈ Q all in binary, The exact complexity of that problem seems to be open even for n = 3.
Consider the discounted-sum game depicted in Example 1. Let σ be the strategy of Eve corresponding to the double edges. This strategy is not admissible: it is dominated by the alternative strategy σ ′ of Eve that behaves like σ from v 1 but that chooses to go to v ′ 2 from v 2 . Indeed, if τ is a strategy of Adam that goes to v 1 , then the outcome plays of σ and σ ′ are the same, thus have the same value. Now, if τ is a strategy of Adam that goes to v 2 , then the value of the outcome play of σ and τ is 0, while the value of the outcome play of σ and τ is ∞ i=2 λ i which is strictly greater than 0. However, the strategy σ is regret-minimizing: v0 (σ, τ )). If τ is the strategy of Adam that goes to v 2 , then the maximal difference of values between plays following σ and plays following alternative strategies is actually attained with σ ′ , and is thus ∞ i=2 λ i . Now, if Adam goes to v 1 , the maximal difference of values between plays following σ and plays following alternative strategies is ∞ i=1 2λ i : if the strategy of Adam is such that it chooses to go to y from v ′ 1 , and to x from v ′′ 1 , playing σ yields a play value of 0, while going to v ′′ 1 yields a play value of ∞ i=2 2λ i , which is strictly greater than ∞ i=2 λ i , since λ > 0. Thus, we have that Reg (σ) = ∞ i=2 2λ i . Symmetrically, any strategy that chooses to go to v ′′ 1 from v 1 also has a regret of ∞ i=2 2λ i . Thus, the strategy σ is regret-minimizing.
Consider now the discounted-sum game depicted in Example 2. Let σ be the strategy of Eve corresponding to the double edges. This strategy is admissible: In this game, Eve has only two available strategies: σ and the strategy σ ′ that goes to v 1 from v 0 . It is easy to see that σ is not weakly dominated by σ ′ . Indeed, let us fix the strategy τ of Adam that goes to v ′′ 1 from v 1 . Against σ, it yields a play value of ∞ i=2 6λ i , while against σ ′ , it yields a strictly smaller play value of ∞ i=2 5λ i . Hence, σ is not dominated by σ ′ , and is thus admissible. The strategy σ is however not regret-minimizing. In fact, the strategy σ ′ has a smaller regret. Indeed, the regret of σ is the difference between the best possible outcome value of σ ′ , which is ∞ i=2 10λ i and its own only outcome value ∞ i=2 6λ i , that is, a regret of ∞ i=2 4λ i . On the other hand, the regret of σ ′ is the difference between the best and only possible outcome value of σ, which is ∞ i=2 6λ i and its own worst possible outcome value ∞ i=2 5λ i , that is, a regret of ∞ i=2 λ i , which is strictly less than ∞ i=2 4λ i . Hence, the strategy σ is not regret-minimizing. Notice that the strategy σ ′ is in fact an optipess strategy (even though rather trivially).

B A proof of the existence of SBWO strategies
We prove the existence of positional SBWO strategies and their characterization stated in Lemma 1.
Proof. The characterization of SBWO strategies actually follows directly from the characterization of SWO and SBO strategies also given by Lemma 1, and from the definition of acVal. Below, we focus on the positionality claim.
From [20,Theorem 5.1] we know that for all u ∈ V ∃ it holds that Denote by A the game obtained by restricting G to the subset of edges

C Proof of Theorem 1
It is known that minimal-regret strategies always exist.
Lemma 10 (Follows from [13,Proposition 18]). For all games and all initial vertices v 0 , there exists a strategy σ of Eve such that Reg (σ) = Reg.
The following upper bound on the "local regret" of strategies that are SWO will be useful.
Lemma 11. For all v 0 ∈ V ∃ and for all SWO strategies σ from v 0 we have that Proof. We first observe that for all strategies σ ′ of Eve and all histories Hence, for SWO strategies, the latter equality always holds.
The following inequalities yield the result.
by the argument above Using the above lemma, it is straightforward to argue that Eve can switch to follow an SWO strategy without increasing her regret.

Lemma 12.
Let σ swo be a SWO strategy of Eve. For all strategies σ of Eve and all v 0 ∈ V , if we let Proof. Observe that a history consistent with σ ′ is a switched history if and only if it has a prefix v 0 . . . v n ∈ Hist(σ ′ ) such that Let S σ ′ denote the maximal local regret incurred by switched histories consistent with σ ′ and U the maximal local regret incurred by all unswitched histories (therefore consistent with both σ and σ ′ . More formally, with the supremum ranging over all switched histories with the supremum ranging over all unswitched histories h ′ = . . . v ′ m ∈ Hist(σ)∩ Hist(σ ′ ). From Lemma 2 and the definition of σ ′ it follows that Reg (σ ′ ) = max(S σ ′ , U ). Now, consider the value with the supremum ranging over all switched histories h ′ = v 0 . . . v n ∈ Hist(σ ′ ) and such that no proper prefix of h ′ is a switched history. (This indeed implies h ′ is consistent with σ too.) From Lemma 11 we have that S σ ′ ≤ S 0 and therefore Reg (σ ′ ) ≤ max(S 0 , U ). Observe that, using Equation (2), we obtain that S 0 ≤ Reg (σ). To conclude the proof it thus suffices to show that U ≤ Reg (σ). However, it follows from Lemma 2 that Reg (σ) = max(S σ , U ) where S σ denotes the maximal local regret incurred by histories consistent with σ and such that they have a prefix that is a switched history consistent with σ ′ . Hence, the claim holds.

⊓ ⊔
The above result provides us with a way of simplifying regret-minimizing strategies: For any v 0 ∈ V and any strategy σ of Eve, there is a second strategy σ ′ of hers that follows σ as long as Equation 2 does not hold for the current history. Otherwise, σ ′ conclusively switches to a worst-case optimal strategy.
The following definition will be useful. We denote by cOpt(u) the set of all best-case-optimal successors of u ∈ V ∃ , i.e.

cOpt(u)
Proof (of Theorem 1). Lemma 10 tells us that for all v 0 ∈ V there exists a strategy σ 0 of Eve such that Reg (σ 0 ) = Reg. Let σ sbwo be a SBWO strategy of Eve. From Lemma 12 we get that the strategy σ = σ 0 is also such that Reg (σ) = Reg. We will now argue that σ is an optipess strategy. In fact, we will prove something slightly stronger: for all SBO strategies σ sbo of Eve, for all h = v 0 . . . v n ∈ Hist(σ) such that h is an unswitched history, we have that Towards a contradiction, assume that this is not the case. That is, there exists such an h for which |cOpt(v n )| > 1 or |cOpt(v n )| = 1 but σ(h) = σ sbo (h) for all SBO strategies σ sbo of Eve. In the latter case, by Lemma 1, we must have that σ(h) ∈ cOpt(v n ). It should be clear that in either case we have that cVal vn ¬σ(h) = cVal vn . We thus have that by definition of aVal.
By assumption, we have that h is an unswitched history and therefore λ n (cVal vn − aVal vn ) > Reg.
The above inequalities thus imply that the regret of σ is strictly larger than Reg, which is a contradiction. ⊓ ⊔

D Proof of Theorem 2
Proof (of Theorem 2). Let σ be a strategy of Eve and D be the set of histories h such that the sequence of inequalities aVal h (σ) ≤ cVal h (σ) ≤ aVal h ≤ acVal h holds with at least one inequality being strict. Denote by sp(D) be the (possibly infinite) subset of D that contains all the shortest prefixes of the histories in D, that is We now define a strategy σ ′ for all histories h = v 0 . . . v n such that v n ∈ V ∃ as follows otherwise.
(Note that σ ′ is well-defined as all the elements in sp(D) are incomparable.) Intuitively, the strategy σ ′ follows σ until the above sequence of inequalities holds -which, essentially, means that one can do better than σ from that point onward. Then, σ ′ switches to follow an SBWO strategy forever.
We first show that σ is weakly dominated by σ ′ . To do so, we will compare σ and σ ′ with regard to the strategies of Adam. Let τ be a strategy of Adam. Let π be the outcome play of σ and τ , and π ′ the one of σ ′ and τ . If π = π ′ , then clearly Val(π) = Val(π ′ ). Otherwise, if π = π ′ , they share a longest common prefix h = v 0 . . . v n . As τ is fixed, we know that v n ∈ V ∃ and that σ(h) = σ ′ (h). By definition of σ ′ , it means that there exists a prefix h ′ of h such that h ′ ∈ sp(D). Thus, we have that cVal h ′ (σ) ≤ aVal h ′ and consequently Val(π) ≤ aVal h ′ . On the other hand, from h ′ we know that σ ′ behaves like σ sbwo h ′ , thus, Val(π ′ ) ≥ aVal h ′ . Hence, Val(π) ≤ Val(π ′ ). This is true for any strategy of Adam, thus σ is indeed weakly dominated by σ ′ .
We now show that σ ′ is admissible. Towards this, we use the characterization from Lemma 3: A strategy σ of Eve is admissible if and only if for every history h ∈ Hist(σ) the following holds: either cVal h (σ) > aVal h or -Assume first that there exists a prefix h ′ of h that belongs to sp(D). In that case, we know that σ ′ behaves like σ sbwo -Assume now that h has no prefix that belongs to sp(D). By definition of σ ′ , this means that σ ′ (h ′ ) = σ(h) for all prefixes h ′ ⊆ pref h. In other terms, σ and σ ′ agree (at least) up to h. Let hπ be an outcome consistent with σ such that Val(hπ) = cVal h (σ) (which exists because discounted-sum games are well-formed). Let τ be a strategy of Adam such that π v0 στ = hπ (which exists because hπ is consistent with σ). Since σ and σ ′ agree up to h, there exists π ′ be such that hπ ′ = π v0 σ ′ τ . Recall that σ is weakly dominated by σ ′ . As τ is fixed, we know that Val(hπ) ≤ Val(hπ ′ ). Thus, we have that Recall now that by definition of sp(D), we know that in particular, it either holds that

Suppose (A) holds. We thus have that cVal
This means that σ ′ satisfies the first part of the characterization. Finally, suppose that (B) holds. We have and thus cVal h (σ ′ ) ≥ acVal h . Furthermore, we also know that aVal h (σ ′ ) ≥ aVal h (σ). As by definition of the antagonistic value, we have aVal h (σ ′ ) ≤ aVal h and aVal h (σ) = aVal h , we obtain aVal h (σ ′ ) = aVal h . We now know that σ ′ is worst-case optimal at h. By definition of acVal h , we can conclude that cVal h (σ ′ ) ≤ acVal h . Since it is also true that cVal h (σ ′ ) ≥ acVal h , we obtain aVal h (σ ′ ) = cVal h (σ ′ ) = aVal h = acVal h , that is, σ ′ satisfies the second part of the characterization.

E On the well-formedness of discounted-sum games
In [6], the authors introduce the notion of well-formed games, that is, games where, for each player, and each history, there exist strategies witnessing the antagonistic and collaborative values at this history. They then show that, in such games, admissible strategies can be characterized in terms of values at any history consistent with the strategy (see Lemma 3). It is worth noticing that for any player, it is in fact sufficient that this player has witnessing strategies for the antagonistic and collaborative values at any history. We call this property wellformedness for a player. In our context, we focus on the strategies and payoffs of Eve, thus we phrase the statement as follows: A game is well-formed for Eve if, for all h ∈ Hist: Note that well-formedness for Eve, in general, does not guarantee the existence of a play that witnesses the collaborative value at any history. However, in discounted-sum games, this is indeed the case. In the proof of Theorem 4, we use the fact that there exists a play consistent with σ sbo that has such value, thus also a strategy τ of Adam such that the outcome of σ sbo and τ is exactly this play. The argument relies on the fact that discounted-sum value functions are continuous. We recall a few useful notions before proving the property.
Considering a discounted-sum game G = (V, v 0 , V ∃ , E, w, λ). The set V is endowed with the discrete topology, and thus the set V ω with the product topology. Then, a sequence of plays (π n ) n∈N is said to converge to a play π = lim n→∞ π n , if every prefix of π is a prefix of all but finitely many of the π n . It is well known that the discounted-sum value function is continuous, that is, whenever a sequence of plays (π n ) n∈N converges to a play π, we have lim n→∞ Val(π n ) = Val(π).
Since d = sup τ Val(out vn (σ sbo , τ )), we know that there exists a sequence (π n ) n∈N of outcomes consistent with σ sbo such that lim sup n→∞ Val(π n ) = d. Since the discounted-sum value function is continuous, we also have that lim n→∞ Val(π n ) = d.
Suppose the sequence (π n ) n∈N eventually stabilizes, that is, there exists N such that π n = π N for every n > N . We have that lim n→∞ Val(π n ) = Val(π N ), thus Val(π N ) = d. Let τ ′ be a strategy of Adam such that out vn (σ sbo , τ ′ ) = π N (which exists since π N is a valid outcome in G). We indeed obtain that Val(out vn (σ sbo , τ ′ )) = d.
Suppose now the sequence (π n ) n∈N does not eventually stabilize, that is, for all n, there exists N > n such that π N = π n . We construct, iteratively, a subsequence (π ′ k ) k∈N from (π n ) n∈N as follows: We start by fixing π ′ 0 = π 0 . Recall that V is a finite set. Let m be its size, we can label the vertices v 0 , . .
We partition the set of all π n according to their prefixes of length 2: For every 0 ≤ i < m, we define P i As V is finite and the set of outcomes in the sequence is infinite, there exists i such that the set P i 1 is infinite as well. We fix P 1 to be such an infinite P i 1 . Let π ∈ P 1 . We fix π ′ 1 = π. Suppose now that π ′ 0 to π ′ k are already determined, as well as the infinite sets P 0 to P k , and that v 0 . . . v ′ k is the prefix of length k + 1 of π ′ k . For every Again, as V is finite and the set P k is infinite, there exists i such that the set P i k+1 is infinite as well. Let P k+1 def = P i k+1 . Let π ∈ P k+1 . We fix π ′ k+1 = π. The subsequence (π ′ k ) k∈N is now well defined and has the following property: for each N ∈ N, each prefix h N of length N + 1 of π ′ N , and k ≥ N , we have h N ⊆ pref π ′ k . Let π be the outcome such that h N ⊆ pref π for all N ∈ N (this outcome is well defined as h N ⊆ pref h N +1 for all N ∈ N). By construction of π, we have that lim k→∞ π ′ k = π. As (π ′ k ) k∈N is a subsequence of (π n ) n∈N , the sequence (Val(π ′ k )) k∈N is a subsequence of (Val(π n )) n∈N , hence: Since the discounted-sum value function is continuous, we also have Val(π) = lim k→∞ Val(π ′ k ). Thus we get Val(π) = d. Let τ ′ be a strategy of Adam such that out vn (σ sbo , τ ′ ) = π (which exists since π is a valid outcome in G). We indeed obtain that Val(out vn (σ sbo , τ ′ )) = d.

F Proof of Lemma 4
Proof (of Lemma 4). We will suppose that neither inequality holds and derive a contradiction. Let k and ℓ be the lengths of α, γ and β, respectively. On the one hand, we have that Val(α 2 · β) > Val(P ). This is equivalent to the following.
On the other hand, we have that Val(β · γ 2 ) > Val(P ). The latter holds if and only if the following does.
The last inequality is already in clear contradiction with Inequality 3. ⊓ ⊔

G Representing and comparing long-history values
Presently, we provide a brief discussion on succinctly encoded (rational) numbers. In this work we have assumed that all weights labeling edges in our game are given as binary-encoded numbers. The discount factor, λ, we also assume is given in binary. That is, λ is given as a pair of binary-encoded natural numbers p, q ∈ N such that q > 0 and p/q = λ. In Section 6 we deal with numbers that seemingly do not admit such classical representations.
Besides encoding a number in binary, one can also consider polynomials (and a binary-encoded valuation of its variables), or arithmetic circuits as representations for numbers (see, e.g., [4]). A number P = a e1 n + · · · + a en n where a i ∈ Z and e i ∈ N for all 1 ≤ i ≤ n, for instance, may be such that P (a, e) ≥ 2 2 n while being representable with a list of binary-encoded numbers using at most n 2 bits. An arithmetic circuit is an even more succinct representation. Formally, such a circuit is a rooted directed acyclic graph whose internal nodes are labelled with operations from {+, −, ×} and whose leaves are labelled with binary-encoded integers. Determining whether a number given as an arithmetic circuit is positive is known as the PosSLP problem and has been shown to be decidable in the fourth level of the counting hierarchy by Allender et al. [1].
In this work, because of the discount factor, when writing formulas for the discounted-sum value of long histories, we may in fact need to use division. Concretely, to determine whether the value of a history is positive one may write down the following inequality In the context of this work, the main application of the arithmetic-circuitencoding discussed here is to express the discounted-sum value of a long history α · β k · γ as follows Val(α · β k · γ) = Val(α) + λ |α| Val(β) 1 − λ |β| (1 − λ |β|k ) + λ |α|+|β|k Val(γ) Note that while Val(α), Val(β), and Val(γ) can be represented using binary rationals, this is not the case for λ |β|k in general.

H Upper-bounding t(v) for optipess strategies
We will now prove the following bound on the finite values of the threshold function for optipess strategies: Lemma 15. For all optipess strategies σ of Eve with threshold function t we have that t(v) is at most exponential for all v ∈ V ∃ with t(v) ∈ N.
Let us fix a value for the size of a game with discount factor λ = p/q. Define

H.1 A lower bound on the regret of a game.
In [13] the following lower bound on the regret of games with non-zero regret was given.
Lemma 16 (From [13,Corollary 12]). For all v 0 ∈ V we have that if Reg > 0 then Using the existence of positional optimal strategies in discounted-sum games (see Lemma 1) it is straightforwards to show the antagonistic and collaborative values are always realized by a simple lasso. That is, a play α · γ ω where α is a simple path and γ is a simple cycle. It follows that both values are representable using binary-number pairs that use polynomially-many bits.
As an immediate consequence of the above lemmas we get the following lower bound on non-zero regret values. Proposition 1. There exists a polynomial P such that for all v 0 ∈ V we have that if Reg > 0 then Reg ≥ 2 −P (|G|) .

H.2 An upper bound on the finite thresholds of an optipess strategy
We first note that, for all v ∈ V ∃ , if cVal v = aVal v then t(v) = 0 and if Reg = 0 then t(v) = ∞. Hence, it suffices to bound the threshold function for all v ∈ V ∃ such that cVal v > aVal v when Reg > 0. In the sequel we focus on an arbitrary vertex v ∈ V ∃ and make the assumption that those two inequalities hold.

Table of Contents
The Impatient May Use Limited Optimism to Minimize Regret . . . . . . . . .