Optimality, Equilibrium, and Curb Sets in Decision Problems Without Commitment

The paper considers a class of decision problems with in_nite time horizon that contains Markov decision problems as an important special case. Our interest concerns the case where the decision maker cannot commit himself to his future action choices. We model the decision maker as consisting of multiple selves, where each history of the decision problem corresponds to one self. Each self is assumed to have the same utility function as the decision maker. We introduce the notions of Nash equilibrium, subgame perfect equilibrium, and curb sets for decision problems. An optimal policy at the initial history is a Nash equilibrium but not vice versa. Both subgame perfect equilibria and curb sets are equivalent to subgame optimal policies. The concept of a subgame optimal policy is therefore robust to the absence of commitment technologies.


Introduction
In this paper, we study a class of decision problems with infinite time horizon that contains discounted Markov decision problems with a finite set of states and actions as an important subclass. In every time period, nature selects a state. We take the perspective of a decision maker that is informed about the state and has to take an action out of a set of actions, thereby obtaining an instantaneous payoff and generating a, potentially probabilistic, transition to a new state. This process is repeated indefinitely. Contrary to Markov decision problems, we allow for history-dependent sets of available actions and history-dependent state transitions.
Our main interest is in the case where the decision maker cannot commit himself to his future actions. He is therefore modeled as consisting of multiple selves that have utility functions identical to the one of the decision maker. Our emphasis is on the characterization of policies that are consistent with Nash equilibrium and subgame perfect equilibrium, and sets of policies that are closed under rational behavior.
To obtain a benchmark, we start our analysis by considering a decision maker that can commit himself to his future action choices. A policy of such a decision maker specifies a profile of history-contingent action choices that are all feasible at the corresponding history. A policy is optimal at a history if it maximizes the utility of the decision maker conditional on reaching that history. A policy is subgame optimal if it maximizes the utility of the decision maker at every history. We show the existence of subgame optimal policies and show that the set of subgame optimal policies has a product structure. Moreover, we characterize subgame optimal policies as a particular product of pure subgame optimal policies. These results are completely in line with those derived for Markov decision problems, see for instance the excellent overview by Puterman (1994).
We continue our analysis by assuming that the decision make cannot commit himself to his future choices. He is fully aware of the actions and the payoff consequences of his future actions, but cannot commit himself to any future choice. There are many examples where the inability to commit to one future actions has drastic consequences. Famous examples are given in the area of monetary policy by Kydland and Prescott (1977), where lack of commitment leads to socially suboptimal decision making, and in durable goods monopoly by Gul, Sonnenschein, and Wilson (1986), where the inability of the seller to commit to future prices exerts a negative externality on its current decisions and reduces its profits.
The decision maker can only rely on the fact that his future self will make an optimal choice. To study this case, we represent a decision problem as a stochastic game with an infinite number of players. Each history of the decision problem is represented by one player, who corresponds to one particular self of the decision maker. The utility functions of the players are all assumed to be identical to each other and equal to the one of the decision maker. 1 A strategy profile as chosen by the players is in a one-one correspondence with a policy in the decision problem.
The standard way to solve a game is the concept of Nash equilibrium as proposed by Nash (1950). At a Nash equilibrium of the decision problem, there is no self of the decision maker who can benefit from taking another action, given that all other selves stick to their actions. This approach to multiple-selves problems corresponds to the one suggested by Peleg and Yaari (1973) in the context of consumption choice with time-inconsistent preferences. We show that an optimal policy at the initial history is a Nash equilibrium, but not vice versa, so a Nash equilibrium may fail to be an optimal policy at the initial history. At a Nash equilibrium of the decision problem, the different selves may fail to coordinate in a satisfactory way, leading to suboptimal behavior.
A well-known problem with the concept of Nash equilibrium is that it does not require conditionally optimal behavior by players in subgames that are reached with probability zero. We continue the analysis by considering the concept of subgame perfect equilibrium as introduced in Selten (1965). We identify the subgames of a decision problem as decision problems conditional on reaching a particular history. A subgame perfect equilibrium of a decision problem is a strategy profile that is a Nash equilibrium of every decision problem that corresponds to a subgame. This approach to multiple-selves problems corresponds to the one suggested by Goldman (1980) in the context of consumption choice with timeinconsistent preferences. We show that the set of subgame perfect equilibria of a decision problem is equal to the set of subgame optimal policies. A Nash equilibrium requires only that deviations are not profitable. So even the concept of subgame perfect equilibrium, requiring that the strategy profile is a Nash equilibrium in every subgame, does not require that unilateral deviations actually involve a loss, which is required by the more demanding notion of strict equilibrium. Although a strict equilibrium, and a fortiori a strict subgame perfect equilibrium, is more convincing as a stable strategy profile, it is not guaranteed to exist in decision problems. We therefore turn to a set-valued version of strict equilibrium as proposed by Basu and Weibull (1991) for games in normal form with a finite number of players. Basu and Weibull (1991) define a set of strategy profiles to be closed under rational behavior (curb) if it contains all its best responses. A minimal curb set is a curb set that does not contain any other curb set as a proper subset. Pruzhansky (2003) proposes a slight variation on this notion for extensive games with perfect information and a finite horizon.
1 Even if the decision maker is modeled as consisting of multiple selves, there is therefore no issue of time-inconsistent preferences as introduced in Strotz (1956) and Pollak (1968). For an overview of that stream of the literature, we refer the reader to Frederick, Loewenstein, and O'Donoghue (2002). Myerson and Weibull (2015) define tenable strategy blocks, leading to a refinement of curb sets.
We define a minimal curb set for a decision problem by requiring that every self of the decision maker has a best response in the curb set conditional on his history being reached. A curb set therefore captures the situation where every self of the decision maker chooses an action that is a best response to some belief over action choices that are best responses for the future selves conditional on the history of the current self being reached. Voorneveld, Kets, and Norde (2005) point towards an important advantage of minimal curb sets. Contrary to point-valued concepts as studied in the equilibrium selection literature, a minimal curb set satisfies the axiom of consistency, a notion that has been introduced by  and Peleg, Potters and Tijs (1996). Consistency requires that if a set of players plays the game according to a particular solution, then the remaining players in the reduced game should not have an incentive to deviate from it. Using consistency, Voorneveld, Kets, and Norde (2005) provide an axiomatization of minimal curb sets.
Another advantage of minimal curb sets is that they are robust in a dynamic sense. Hurkens (1995) studies a stochastic version of fictitious play in the spirit of Young (1993) and shows that such a dynamic process of strategy adjustment will eventually settle down in a minimal curb set. Similarly, Young (1998) presents a fictitious play process with independent beliefs such that the stochastically stable states of the process correspond to the minimal curb sets minimizing the stochastic potential, see also Durieu, Solal, and Tercieux (2011) for the analysis of a more general class of fictitious play processes. Balkenborg, Hofbauer, and Kuzmics (2013) show how generalized best reply dynamics settle down within a minimal CURB set based on the refined best reply correspondence. Further results on the connection between learning dynamics and minimal CURB sets can be found in Kah and Walzl (2015).
A curb set is said to be tight if it is exactly equal to its set of best responses. Since a strict equilibrium corresponds to a singleton tight curb set, a tight curb set is indeed the appropriate set-valued generalization of a strict equilibrium. We show that a minimal curb set of a decision problem is tight. We also prove that a minimal curb set always exists, is unique, and coincides with the set of pure subgame optimal policies. The rest of the paper is organized as follows. In Section 2, we define a class of decision problems that contains Markov decision problems as a special case. Section 3 provides the basic definitions related to policies and gives a characterization of subgame optimal policies. We introduce multiple selves in Section 4, define and study the concepts of Nash equilibrium and subgame perfect equilibrium of a decision problem and show the set of subgame perfect equilibria to coincide with the set of subgame optimal policies. Section 5 introduces the notion of a minimal curb set of a decision problem and shows that it coincides with the set of pure subgame optimal policies. Section 6 concludes.

Decision Problems
A decision problem is described by the tuple D = (S, A, H, π, f ). Moves are made by nature and the decision maker in an alternating fashion, where the decision maker chooses actions in the non-empty, finite set of actions A and nature picks states in the non-empty, finite set of states S. The payoff to the decision maker depends on the entire sequence of states and actions thus produced. The class of decision problems that we study contains the class of discounted Markov decision problems as a subclass. The element s 0 is a distinguished element of S called the initial state.
We let N = {0, 1, . . . } denote the set of natural numbers with 0. Each element h of the set H is a finite sequence of the form h = (s 0 , a 0 , . . . , s −1 , a −1 , s ) where is a natural number, s 0 , . . . , s are elements of S and a 0 , . . . , a −1 are elements of A. Elements of H are called histories.
Given a history h = (s 0 , a 0 , . . . , s −1 , a −1 , s ) in H, we denote its length by (h). Moreover, for k = 0, . . . , (h) − 1, we denote the state in period k by s k (h), the action taken in period k by a k (h), and the current state s (h) Consider histories h and h in H, where h = (s 0 , a 0 , . . . , s −1 , a −1 , s ). The history h is said to be a subhistory of h if h = (s 0 , a 0 , . . . , s k−1 , a k−1 , s k ) for some k ≤ . It is said to be a proper subhistory of h , if k < . We write h ≤ h to denote that h is a subhistory of h , and h < h to denote that h is a proper subhistory of h . The unique subhistory of h of length k is denoted by h k .
The set of actions available at a history h ∈ H is denoted by A h , so It is convenient to define the set G of nature histories, i.e. histories after which nature selects the next state, as We extend the notions of subhistories and length to nature histories in the straightforward way.
The set of states that may be reached at g ∈ G is denoted by S g , so The set of histories H is assumed to have the following properties: 2. For every g ∈ G, S g = ∅.
3. For every h ∈ H, each subhistory of h is an element of H.
We do not impose any restriction on the set of available actions at a history h ∈ H, apart from the existence of at least one such action. Similarly, we do not impose any restriction on the set of possible states at a nature history g, apart from the existence of at least one such state.
The function π is a law of transition that assigns to each nature history g ∈ G a probability distribution on the set S g and thereby specifies the transition probabilities. We let π(s | g) ≥ 0 denote the probability that the system jumps from nature history g ∈ G to state s ∈ S g . Obviously, it holds that s∈Sg π(s | g) = 1.
Consider an infinite sequence p = (s 0 , a 0 , s 1 , a 1 , . . . ). The sequence p is said to be a play if the finite sequence (s 0 , a 0 , . . . , s k , a k ) is an element of H for every k ∈ N. We let P be the set of plays. We endow P with the topology generated by the basis of cylinder sets.
The payoff function f : P → R assigns payoffs to plays. Throughout this paper we assume that the function f is continuous.
A decision problem proceeds as follows. At stage 0 the state is given to be s 0 . We let h 0 = (s 0 ) be the first decision history encountered by the decision maker. The decision then chooses an action a 0 ∈ A(h 0 ) and the transition to the next state s 1 occurs with probability π(s 1 |h 0 , a 0 ). We let h 1 = (s 0 , a 1 , s 1 ). The decision maker then chooses an action a 1 ∈ A(h 1 ), and the transition to state s 2 occurs with probability π(s 2 |h 1 , a 1 ). This process continues ad infinitum. Nature and the decision maker thus produce a play p = (s 0 , a 0 , s 1 , a 1 , s 2 , . . . ). The decision maker receives the payoff f (p).
An important subclass of decision problems are Markov decision problems. The decision problem is said to be a discounted Markov decision problem if [1] the set of available actions at a history depends only on the current state, [2] the transitions probabilities depend only on the current state, and [3] there is a function u : S × A → R, called the instantaneous payoff function, and a number δ ∈ (0, 1), called the discount factor, such that for any play p = (s 0 , a 0 , s 1 , a 1 , . . . ) we have Given a decision problem D as above and a history h = (s 0 , a 0 , . . . , s −1 , a t−1 , s t ) in H we introduce a conditional decision problem D h to be the problem that the decision maker faces once the history h has occurred. The idea of such conditional decision problems is similar to the idea of a subgame in game theory. In the decision problem D h the initial state is s t . The set of histories in D h , denoted H h , is the set of sequences h = (s 0 , a 0 , . . . , s −1 , a −1 , s ) such that s 0 = s t and the sequence (h, h ) = (s 0 , a 0 , . . . , s t , a 0 , . . . , s −1 , a −1 , s ) is an element of H. We let P h denote the set of plays of D h . Transition probabilities in D h are defined in the obvious way and the payoff function is given by (h, p) denotes the infinite sequence (s 0 , a 0 , . . . , s t , a 0 , s 1 , a 1 , . . . ). In particular, it holds that D s 0 = D.

Policies
In this section, we define various notions related to policies and characterize optimal policies.
A policy is a function σ assigning to each history h ∈ H a probability distribution σ(h) on the set A h . The set of policies is denoted by Σ. A policy is said to be pure if for each h ∈ H the distribution σ(h) assigns probability 1 to some particular action in A h . For each history h ∈ H, a policy σ of the decision problem D induces a policy σ h in the decision We let U (σ) denote the expected payoff of the policy σ. Formally, U (σ) is the expected value of the payoff function f with respect to the probability measure on P generated by the policy σ and the law of transition π. Similarly, we let U h (σ) denote the expected payoff of the policy σ conditional on the history h being reached. In particular, it holds that U s 0 (σ) = U (σ). Equivalently, we let σ h denote the policy in the decision problem D h induced by σ and observe that U h (σ) is equal to the expected payoff of σ h in the decision problem D h .
We let v h denote the value of the decision problem D h , that is the highest expected payoff that the decision maker can achieve, once the history h has occurred, Equivalently, σ is subgame optimal if for every h ∈ H the policy σ h is optimal at the initial history of the conditional decision problem D h .
A subgame optimal policy is clearly optimal at the initial history, but the reverse may not be true. If σ is optimal at the initial history, and if history h is reached with positive probability under σ, then σ is also optimal at h. In general, however, a policy that is optimal at the initial history need not be optimal at all histories, and hence it may fail to be subgame optimal.
The set of policies Σ endowed with the product topology is compact by the Tychonoff product theorem. Since the payoff function U h is continuous on Σ, an optimal policy at history h always exists. We next consider a characterization of subgame optimal policies in terms of one-day optimal actions, a result that is known in various guises in dynamic programming, see Blackwell (1965) and stochastic games, see Shapley (1953).
The values defined above satisfy the following recursive relations: To prove the if part of the theorem, consider a policy σ such that σ(h) only assigns positive probability to the members of O h . For t ∈ N, let σ t denote a policy such that [1] for every history h with length smaller than t, σ t (h) = σ(h), and [2] for every history h of length t, σ is optimal at h. Thus in particular σ 0 is optimal at the initial history. Unraveling the relation (3.1), we obtain U (σ t ) = v. Using the continuity of the function U , and the fact that σ t converges to σ as t → ∞, we obtain that U (σ) = v. A similar argument shows that σ is optimal at each history h.
It follows from the above theorem that a pure policy σ is subgame optimal if and only if σ(h) ∈ O h for every h ∈ H. The set of pure subgame optimal policies can therefore be written as a cartesian product h∈H O h , henceforth to be denoted by O. Similarly, the entire set of subgame optimal policies can be written as a cartesian product h∈H  For a Markov decision problem, a policy σ is optimal at the initial state s 0 if U s 0 (σ) = v s 0 . The policy σ is said to be optimal in the Markov decision problem D if it is optimal at each initial state, e.g. Puterman (1994). A classical result states that a discounted Markov decision problem D has a stationary optimal policy. A stationary optimal policy is also subgame optimal in the sense defined before. However, simple examples suffice to show that a history dependent optimal policy need not be subgame optimal. Subgame optimality is therefore a natural strengthening of optimality.

Multiple Selves
In this section, we take the perspective that the decision maker cannot commit himself to his future actions. To do so, we model the decision maker as consisting of multiple selves. This leads us to a game-theoretic model in which each history of the decision problem is associated with a self. While all selves have the same payoff function, identical to the one of the decision maker, they make their decisions independently. This potentially opens up a possibility for miscoordination. In this section we analyze the game using the concepts of both Nash and subgame perfect equilibrium.
Let D be a decision problem as in Section 2. At each history, the current self of the decision maker is free to take any action. Every possible history h ∈ H therefore leads to a self of the decision maker, also referred to as a player. A pure strategy of player h is an element of A h and a mixed strategy is an element of ∆(A h ), the set of probability distributions on A h . A profile σ = (σ(h)) h∈H of mixed strategies describes a strategy choice for each player. Notice that as a mathematical object, a strategy profile is equivalent to a policy. The utility function of every player h ∈ H is the same and is identical to the one of the decision maker at the initial history, U .
We start our discussion by applying the concept of Nash equilibrium to the decision problem D viewed as a game played by multiple selves. A strategy profile is a Nash equilibrium of D if no player can improve the payoff at the initial period by a unilateral deviation. More precisely, σ ∈ Σ is a Nash equilibrium if for every h ∈ H and for every  Figure 1: A decision problem with a Nash equilibrium that is not optimal at the initial history.
, where σ/η(h) denotes the strategy profile obtained from σ by replacing its coordinate σ(h) by η (h). If a policy σ * ∈ Σ is optimal at the initial history, then it is a Nash equilibrium of D. For recall that by definition a policy that is optimal at the initial history maximizes the payoff function U over the entire set of policies. Thus in particular no unilateral deviation from σ * can improve the payoff.
The following example shows that the converse is not necessarily the case: a Nash equilibrium of D may fail to be an optimal policy at the initial history. Thus, under the concept of Nash equilibrium, multiple selves can severely fail to coordinate.
Example 4.1 Consider the decision problem depicted in Figure 1. Formally, this could be modeled as a decision problem D where the set of states S is a singleton, the set of actions is A = {a, b}, the set H consists of all sequences (s 0 , a 0 , . . . , s 0 ) where a t ∈ A and s 0 is the only state, and f (p) depends on the play p = (s 0 , a 0 , s 0 , a 0 , . . . ) only through its coordinates a 0 and a 1 as follows: f (p) = 0 if a 0 = a or if a 0 = b and a 1 = a, and f (p) = 1 otherwise.
Obviously, playing a 0 = a 1 = b is an optimal policy at the initial history. However, the decision problem D has another Nash equilibrium: playing action a at both periods 0 and 1. Indeed, a unilateral deviation by player s 0 , the player active at the root of the tree, to action b is not profitable, because player (s 0 , b, s 0 ) is assumed to stick with action b. The unilateral deviation to action b by player (s 0 , b, s 0 ) is not profitable, because player s 0 plays a, so that player (s 0 , b, s 0 ) has no effect on the payoff. This second Nash equilibrium leads to a strictly lower payoff than an optimal policy at the initial history.
We presently turn to the concept of subgame perfect equilibrium. We argue that in a subgame perfect equilibrium, full coordination obtains. More precisely, we show that the set of subgame perfect equilibrium strategy profiles coincides with the set of subgame optimal policies.
A strategy profile is a subgame perfect equilibrium if it induces a Nash equilibrium in each subgame. Thus σ ∈ Σ is a subgame perfect equilibrium if for every h ∈ H the strategy profile σ h is a Nash equilibrium of D h . Equivalently, σ is a subgame perfect equilibrium of D if for every h ∈ H and every η (h) Theorem 4.2 Let D be a decision problem. A policy is subgame optimal if and only if it is a subgame perfect equilibrium of D.
Proof: Suppose a policy σ is subgame optimal. Then for every history h ∈ H the policy σ is optimal at history h. Hence σ h is a Nash equilibrium of D h . We conclude that σ is a subgame perfect equilibrium of D.
Conversely, suppose a policy σ is subgame perfect equilibrium of D. Let η be an arbitrary policy. For t ∈ N, let σ t be the strategy profile defined as follows: Let σ t (h) be equal to η(h) for each history h with length smaller than t and be equal to σ(h) for a history h of length at least t. In particular, it holds that σ 0 = σ.
It is sufficient to show that, for every h ∈ H, for every t ∈ N, Indeed, using the continuity of U h and the fact that σ t converges to η as t goes to infinity, one then concludes that U h (σ 0 ) ≥ U h (η), as desired. We continue by proving (4.1). Take some t ∈ N. If (h) ≥ t + 1 then σ t h = σ t+1 h = σ h , so (4.1) holds with equality.
Consider a history h of length (h) = t. Since σ h is a Nash equilibrium in D h , the player active at history h does not profit from a unilateral deviation from σ(h) to η (h). Notice that such a deviation induces the strategy profile σ t+1 h in D h . We conclude that (4.1) is satisfied.
Finally, we use induction to prove (4.1) for histories of length 0, . . . , t. We already know that (4.1) holds for all histories of length t. Suppose we have proven (4.1) for all histories of length k + 1 ∈ {1, . . . , t}. Consider a history h of length (h) = k. It holds that where the inequality follows by the induction hypothesis. This completes the induction step and the proof of the theorem.
The following corollary follows immediately from the preceding theorem.

Sets Closed Under Rational Behavior
Although we have derived an equivalence between the subgame optimal policies and the subgame perfect equilibria of a decision problem, one may still criticize the lack of stability of Nash equilibrium for the decision problems at the various histories. The problem is essentially that a Nash equilibrium only requires a deviation not to be profitable, rather than requiring that it actually involves a loss. The concept of strict equilibrium addresses this issue, but may fail to exist. For instance, in a decision problem where a player h ∈ H can choose between two distinct best responses, a strict equilibrium does not exist.
The following example is taken from Basu and Weibull (1991) to illustrate that indifferences of a player about what action to take can make a Nash equilibrium unstable.
Example 5.1 Consider the normal-form game depicted in Figure 2. This game has a unique Nash equilibrium where player 1 randomizes between T and M with probability 2/3 and 1/3, respectively, and player 2 randomizes between L and R with probability 1/4 and 3/4, respectively. Under the Nash equilibrium beliefs, player 1 is indifferent between T and M and player 2 is indifferent between L and R. If player 2 believes that player 1 is going to choose M with probability above 1/3, a belief that is quite natural given that player 1 is indifferent between T and M, then player 2's unique best response is L. And if player 1 believes it to be quite likely that player 2 chooses L, then his unique best response is B. It therefore is not justified to exclude B as a reasonable choice for player 1.
To address the instability of Nash equilibrium, Basu and Weibull (1991) consider minimal sets of strategy profiles that are closed under rational behavior (curb) for normalform games with a finite number of players. In this section, we define curb sets for decision problems. We show that a minimal curb set is unique and equal to the set of pure subgame perfect equilibria, so therefore equal to the set of pure subgame optimal policies by virtue of Corollary 4.3.
Let D be a decision problem as in Section 2. We say that a subset X of h∈H A h is a product set if X = h∈H X h for some X h ⊆ A h . Let B denote the collection of all non-empty products sets. For every X ∈ B, we define the subset ∆(X) of Σ as the set of strategy profiles σ such that for every h ∈ H the support of σ(h) is contained in X h . Thus ∆(X) is of the following form, We recall that the set of pure subgame optimal policies, denoted O, is a product set, with its factor O h being the set of one-day optimal actions at h, and that ∆(O) is the set of subgame optimal policies of D.
The set of pure best responses by player h ∈ H at history h against a strategy profile σ ∈ Σ is defined by (5.1) Note that the pure strategy profile σ is a subgame perfect equilibrium of D precisely when σ(h) ∈ b h (σ) for every h ∈ H. We proceed to define the function µ : B → B, called the curb operator for D, as follows: For every X ∈ B, let Thus a pure policy η is an element of µ(X) if for every player h ∈ H there exists a policy σ ∈ ∆(X) such that η(h) is player h's best response to σ. Essential to this definition is the order of quantification. It reflects the fact that different players are allowed to hold different, and incompatible, beliefs about their future selves.
Definition 5.2 Let D be a decision problem. A set X ∈ B is closed under rational behavior (curb) if µ(X) ⊆ X. A curb set is minimal if it does not contain any curb set as a proper subset.
The set of pure strategy profiles X is curb if in case all players believe that actions outside X are played with probability 0 implies that rational players will only play actions inside X. Since the curb criterion is met by the set X = h∈H A h of all pure strategy profiles, we are particularly interested in minimal curb sets.
For normal form games with a finite number of players, Basu and Weibull (1991) show that a minimal curb set always exists, though it may not be unique. The next result claims that also every decision problem has at least one minimal curb set.
Theorem 5.3 Let D be a decision problem. Then D has at least one minimal curb set.
Proof: Clearly, the set h∈H A h is a curb set. Let C be the collection of all curb sets. We define the partial order ⊇ on C in the usual way, so for X, X ∈ C it holds that X ⊇ X if and only if, for every h ∈ H, X h ⊇ X h .
Let D be a subset of C that is totally ordered by ⊇. We define X = ∩ X∈D X. Since D is totally ordered by ⊇ and, for every X ∈ D, for every h ∈ H, X h is finite, it follows that X is non-empty.
We show X to be a curb set. For every X ∈ D, since X ⊇ X , it holds that µ(X) ⊇ µ(X ). It follows that where the first inclusion follows from the fact that every X in D is a curb set. We have shown that X is a curb set.
The set X is an upper bound on D, hence, by Zorn's lemma, it holds that C has at least one maximal element. A maximal element of C with respect to ⊇ is a minimal curb set.
A strict subgame perfect equilibrium is a pure strategy profile σ such that µ({σ}) = {σ}, so is a singleton curb set. The set-valued generalization of a strict subgame perfect equilibrium is a curb set X such that µ(X) = X.
Definition 5.4 Let D be a decision problem. A curb set X is tight if µ(X) = X.
A tight curb set has the desirable property that none of its elements can be deleted if players hold beliefs in ∆(X). For normal-form games with a finite number of players, Basu and Weibull (1991) show that a minimal curb set is always tight. The next result states that also for decision problems every minimal curb set is tight.
Theorem 5.5 Let D be a decision problem. If X is a minimal curb set of D, then X is tight.
Proof: Let X be a minimal curb set of D. Since µ(X) ⊆ X, it holds that µ(µ(X)) ⊆ µ(X), so µ(X) is a curb set. Since the curb set X is minimal and µ(X) ⊆ X, it follows that µ(X) = X, so the curb set X is tight.
The next result shows that the set of pure subgame optimal policies is a minimal curb set. We have not yet ruled out the possibility that there are other minimal curb sets.
Theorem 5.6 Let D be a decision problem. Then the set of pure subgame optimal policies O is a minimal curb set of D.
Proof: We first argue that for each σ ∈ ∆(O) it holds that b h (σ) = O h . To see this, take some σ ∈ ∆(O) and consider the maximization problem (5.1). Defining a = a(h), we can write Hence, the maximum in (5.1) equals v h . It is reached if and only if a is an element of O h .
It follows that µ(O) = O, so that O is a curb set. To see that O is a minimal curb set, let X be a curb set such that X ⊆ O. Take any σ ∈ X. Since b h (σ) = O h for every h ∈ H, we have O ⊆ µ(X) ⊆ X, and therefore X = O.
A normal-form game can have several minimal curb sets. The next result shows that the minimal curb set of a decision problem is unique and therefore equal to O.
Theorem 5.7 Let D be a decision problem. Then O is the unique minimal curb set of D.

Proof:
Step 1: Let X be a curb set of D. Then the function U is constant on X.
Let D = (S, A, H , π , f ) be the decision problem that is identical to D, except that the set of actions at a history h ∈ H is restricted to X h , so H consists of histories h such that a k (h) ∈ X h k for each k ∈ {0, . . . , (h) − 1}. The set G ⊆ G contains the nature histories corresponding to H and π is the restriction of π to G , and f is the restriction of f to plays of D . Let v denote the value of D and µ its curb operator. Let O be the set of pure subgame optimal policies of D . By Theorem 5.6, O is a minimal curb set of D .
We prove Step 1 by showing that U (σ) = v for each σ ∈ X. For h ∈ H, define X h to be equal to O h if h ∈ H and to be equal to X h if h ∈ H \ H . Let X = h∈H X h . We argue that X is a curb set of D. Since X ⊆ X, we have that µ(X ) ⊆ µ(X) ⊆ X. In particular, for h ∈ H \ H we have µ h (X ) ⊆ X h = X h . Now Take a history h ∈ H and consider the decision problem D h . The set Y = h ∈H h X (h,h ) is curb for the decision problem D h . By Step 1, the payoff function U h is constant on Y . The result follows.
Step 3: Let X be a curb set of D. Then it holds that X ⊆ O.
Take any σ ∈ X and suppose that σ / ∈ O. By Corollary 4.3, σ is not a subgame perfect equilibrium of D. Consequently, there exists h ∈ H such that σ(h) / ∈ b h (σ). Hence, for a(h) ∈ b h (σ), we have U h (σ/a(h)) > U h (σ). Since X is a curb set, it holds that b h (σ) ⊆ X h . Thus a(h) is an element of X h , and hence σ/a(h) is an element of X. This is a contradiction to Step 2. The result of Step 3 follows.
Finally, since X is a curb set while O is a minimal curb set, we conclude that X = O. Thus O is the only minimal curb set of D, as desired.

Conclusions
The standard analysis of decision problems assumes perfect commitment of the decision maker. In this paper, we take the perspective that the decision maker cannot commit to his future action choices. The decision maker is therefore modeled as consisting of multiple selves. The current self of the decision maker has to form beliefs regarding the behavior of his future selves.
We study a class of infinite horizon decision problems that contains the class of Markov decision problems as a special case. We formulate the concepts of optimality at a history and subgame optimality for a policy as a benchmark. These concepts implicitly assume that the decision maker can commit to his future action choices. We argue that these concepts are robust with respect to the multiple selves model under fairly weak assumptions regarding the rationality of future selves. Both the concept of subgame perfect equilibrium and the concept of closed under rational behavior yields the set of subgame optimal policies as the unique prediction. Only a concept like Nash equilibrium that makes significantly weaker assumptions with respect to the rationality of future selves leads to a wider class of policies.