Full-information best choice game with hint

In the classical full-information best choice problem a decision maker aims to select the best opportunity. His decision is based only on the exact values of the observed sequence. In this paper we consider two modifications of the above problem. We add a second player who can either propose additional information or block the observed object and demand an extortion. Our goal is to establish an optimal reward for the second player and the best moment to interrupt the decision process. The situation when the number of observations tends to reach infinity has been studied.


Introduction and literature review
The best choice problems are the most inspiring problems in modern mathematics. Its origin is the so-called secretary problem. A comprehensive work on this issue can be found in Ferguson (1989). Gilbert and Mosteller (1966) considered different variants of the best choice problems for the very first time and solved by the heuristic arguments. They categorize the rank-based problem as a no-information case (which includes the classical secretary problem). On the other hand, we have the so-called full-information case, where we may base our choice of the stopping time on the true values of the object. This is a much more complex problem. In other words, we can say that the noinformation problem is a simplified, full-information problem. It is always possible to calculate the rank of the currently observed object by counting how many predecessors were better. Unfortunately, the opposite operation is not possible. In this work we focus on the full-information case. A Markovian approach, which is widely used in this article was presented by Bojdecki (1978). Exact solutions for initial problems were presented by Samuels (1982). Many modifications of the problem have been made. Porosiński (1987) presented a model in which the random horizon was introduced. The link between the infinite problem and planar Poisson process was presented in Gnedin (1996), and following that, the exact results for the initial problem were derived by Gnedin and Miretskiy (2007). The full-information best choice problem where two choices are allowed was presented in Porosinski (1992). Petruccelli (1982) allowed the solicitation in the choice. A modification in which the decision maker can go back only in some fixed number was considered by Tamaki (1986). Recently, Kuchta (2017) derived an optimal strategy for the iterated full-information best choice problem.
The game version of the best choice problem was treated by many authors, e.g. (Porosiński and Szajowski 1996;Sakaguchi and Szajowski 1997;Sakaguchi 1984). The game with a hint was presented in Dotsenko and Marynych (2014). The authors considered the problem for the no-information case. In their version, the decision maker can observe only the ranks of objects.

Preliminaries: full-information best choice problem
Consider a probability space (Ω, F, P). By E[·] we denote the usual expectation with respect to the probability measure P. Fix n ∈ N and consider an i.i.d. sequence X 1 , . . . , X n from continuous distribution F(x). Without loss of generality we can assume that it is a uniform distribution U(0, 1), i.e. F(x) = x, x ∈ [0, 1]. Define a filtration F k = σ (X 1 , . . . , X k ) , k = 1, . . . , n.
By T denote a set of all stopping moments with respect to the family (F k ) k=1,...,n . The aim is to find the stopping moment τ * ∈ T such that where I A (ω) denotes an indicator of the set A The moments of consecutive local maximum (cf. Bojdecki 1978) are given by By T 0 let us denote the set of the moments defined in (3). Note that T 0 ⊂ T . Consider the following sequence τ j ∈ T 0 for all j. In the case when {τ j = ∞} we introduce a special absorbing state δ. The sequence in (4) is a homogeneous Markov chain on the state space E = (1, . . . , n) × (0, 1) ∪ δ with sigma algebra E. One step transition probabilities (i.e. P(ξ j+1 = b|ξ j = a)) for the above chain are defined as (5) It is sufficient to find the optimal stopping time in the set T 0 . Knowing that kth object is relatively the best and its value is X k = x in case of selecting it we obtain a gain function given by (6) is provided by property of a Markov chain (4). Let T be an operator of a conditional expectation and V (·) be the value function of the problem (cf. Shiryayev 1978). From general theory we know that V satisfies a Bellman equation Let E a [g(ξ 1 )] be the expected payoff for one step starting from the state a.
Consider the set of states where the inequality g(a) ≥ T g(a) holds. Let Since the problem is monotone the One-Step-Look-Ahead rule is optimal (see Bojdecki 1978 ) and the optimal region is where d n−k ∈ (0, 1) is the solution of equality The optimal stopping rule is given by The value of the problem (cf. Sakaguchi 1973) is Let i = n − k. Then, d i is an increasing sequence: d 0 ≤ d 1 ≤ · · · ≤ d n−1 . We show some elementary properties of these thresholds. Recall Bernoulli's inequality.
Theorem 1 For a ≥ −1 and b ≥ 1 The proof can be found in Bullen (2013) and here it is omitted.
Lemma 1 For any i Proof Since the sum in (10) is monotonically decreasing as a function of x and it is greater than 1 for all x < d i it is sufficient to show that the inequality which proves our assertion.
For an upper limit of d i we refer to Gilbert and Mosteller (1966). Then, we get where z is the unique solution of the equation in the interval (0, 1). z ≈ 0.804354. Let us check some properties of the left hand side of the function from the Eq. (10). Let f i (x) denote a sequence of functions described as Let us write the formula in a recursive form

The model
Suppose that except the decision maker (further denoted as DM) in the full-information best choice problem there is another player. However, he does not make a decision about stopping and choosing the best object since he has extra information about the best object, i.e. he knows exactly both position and the value of the current element. We will call him a prompter or a prophet (further denoted as PR). His aim is to sell this information in a proper moment and get for it as much as possible. PR must establish the price α for the hint before the beginning of the game and he can sell his knowledge only once during the game. The decision maker can accept this proposition, pay a fixed price and get information whether the current object is the best one or not. He can also reject the purchase option and then stop or continue observations. The above game can be presented as a graph, i.e. as a game in an extensive form in each moment k and the actual value of the observed object x, i.e. in the state a = (k, x). The payoff function for PR is written at the bottom of the graph.
The goal is to establish the price of the hint α = α(k, x). Consider a Markov chain (ξ k ) n k=1 observed by DM as in (4) in the state space (E, E), transition probabilities (5) and F k = σ (ξ 1 , . . . , ξ k ). Denote by ρ the strategy of PR, i.e. stopping moments with respect to the family F k . Let τ,τ denote the stopping moments of DM and let δ k be a random variable which has value 1, if the proposition of the hint is accepted and 0 otherwise. If the offer is accepted, the history of observations will be enriched by the random variable H k . If the event {ω : ρ(ω) = k} occurs, the strategies of DM will change into two dimensional (δ ρ , τ ρ ), where Let us introduce the concept of the hint. In fact, the hint is an indicator function of the absolutely maximal element in the observed sequence. We can denote it as Suppose that we are in the state (k, x), i.e. in a moment k, we observe a locally maximal object X k whose value is x, x ∈ (0, 1). There are two possibilities: Then, the optimal rule calls for continuing the observations, so the reward function (win probability) is where a ∨ b = max{a, b}. In case of using the hint, the decision maker can get the information "this is the best object among all" with probability x n−k or the opposite information with probability 1−x n−k . In the first case the decision maker will stop and choose the object. Otherwise, he will continue the observations in an optimal manner. Thus, the win probability is We define the value of the hint v 1 as a difference between a reward with the hint and a reward without the hint, i.e the difference between (18) and (17) v In case of X k = x, x ≥ d n−k the optimal rule calls for a stop immediately. The win probability is If the decision maker decides to use the hint, the payoff is (21) which gives the value of the hint: the difference between (21) and (20): Fact 1 Let x ≤ d n−k for k ∈ {1, . . . , n}. The function v 1 (k, x) is an increasing function of x.
Proof Since T v(k, x) is decreasing as x goes to the threshold d n−k the whole function as a multiplication of increasing functions is increasing.
For the function v 2 (i, x) we have Proof Let us calculate the derivative of function v 2 with respect to x and let i = n − k. We obtain The derivative is negative if for all x from the domain of the function v 2 (i, x). From the description of the problem we know that i m=1 x −m − 1 m ≤ 1 and x i ≤ 1 and , so we conclude that the derivative is negative and the function v 2 (k, x) is the decreasing function of x.
For the fixed index k the maximal value of v 1 is and since the sum in the above formulas is equal to 1 (see 10 ) we get that a n−k : Lemma 2 Let i = n − k. The sequence a i is decreasing in i and where z is given by (13).
Proof A sequence c i = d i i is decreasing and converges to e −z . (cf. Sakaguchi 1973). It is also bounded since e −z ≤ c i ≤ 0.5. Consider a function f (x) = x(1 − x). For x < 0.5 it is increasing. Therefore, we get that a product d i i (1 − d i i ) is decreasing. The product is also bounded and converges to the product of the limits of sequences c i and 1 − c i . Note that this value is greater than the value for the no-information case (where it is equal to e −1 (1 − e −1 ) ≈ 0.232544). Figure 1 shows first 20 values of a i . Let us recall the principle of optimality. An optimal policy has the property that whatever the initial state and the initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision (cf. Bellman 1957). To find the exact form of the value function v 1 (k, x) we can consider the remaining observations as the best choice problem for a random horizon. To be more specific, let us consider the following Lemma 3 Suppose in the full information best choice problem with finite horizon n the current state of the process is (k, x) for some k ∈ {1, 2, . . . , n} and x ∈ (0, 1). Suppose that the process has not been stopped yet. Then, the optimal strategy is to stop at the first (if any) state (k + m, u) such that 0 ≤ m ≤ n − k and u ≥ u n−k−m (x), where u n−k−m (x) is the solution of the equation The win probability of using the optimal strategy is where Proof Since we observe the current object whose value is x and it is a relative maximum, we can truncate the further chain only in those observations that are greater than x. The probability that observations in moments k + 1, k + 2, . . . , n − k will be bigger than x is 1 − x. Therefore, from now, we consider a full information best choice problem with random horizon with observations from the uniform distribution on the interval (x, 1). The horizon M is binomially distributed, i.e.
Consider the following sequence From Porosiński (1987) we know that if the above sequence changes the sign Ktimes, then, the stopping region has no more than K stopping islands. However here {d(m, u)} n−k m=0 changes the sign at most one time. When k is close to n its value decreases to 0. So, the truncated problem is monotone and the optimal strategy is a threshold strategy. The thresholds u n−k−m (x) can be calculated directly from (28) (25).

Remark 1
In the classical version of the best choice problem with random horizon provided by Porosiński, the payoff function forces the decision maker to make at least one step. It is not possible to stop at the very beginning or, in the language of Markov chains, at stage (0,0). However, here in the truncated problem, such a possibility exists because the payoff for the "initial" stage can be bigger than the expected one. The "initial" value x of the current object must be bigger than a threshold value. This threshold is u n−k (x). It can be calculated from (25) for m = 0 Since in this special state u = x we see that the u n−k (x) is the unique solution of The same situation is in the case of (10) (this equivalence was shown in Samuels 1982). So, u n−k (x) = d n−k .
The prompter has two strategies to choose from: to sell the information or not to sell. Then, the DM has to choose either to buy the hint or not. However, if the price of the hint is less than the maximum value of the hint, the decision maker without a doubt will buy the hint. So, the PR has to decide before the game what the price for the hint will be. He has the following possibilities: 1. Set the constant price α during the whole game 2. Set the vector of the prices depending on the moment of the game: α = (α 1 , . . . , α n ) 3. Set the price function depending on the value of the current observation: α = α(x) 4. Set the vector of the prices depending on the the moment of the game and the current value of the observed object: α = (α 1 (x), . . . , α n (x))

3.2˛= const
Consider the following numbers There are three possibilities of the value of the price: -α ≥ 0.25: then, the hint is not worth buying. The price is higher than its value.
α < e −z (1 − e −z ): then, the hint is worth buying for k = 1, 2, . . . , n no matter how big is n. Using the previous symbol we can say that in this case k * = 1. Suppose that α < e −z (1 − e −z ). The hint will be sold if the current state of the process is in the set Suppose that in the moment k we observe a relatively maximal element whose value is x and it is worth buying a hint. Therefore, the probability of that event is given by where a ∧ b = min{a, b}. The average payoff for the hint is equal to The optimal price is such a minimal number α that maximizes the Eq. (32):

The model
In this case, the prompter who knows the exact value of the hint does not want to sell the knowledge as it was in Sect. 3. During the whole game he can block the current element once and demand from the second player to unlock the hidden element. The decision maker has two strategies: to pay an amount of money and stop at the unlocked element or do not pay and continue observations. The graph below presents the possible strategies of both players.

NOT BLOCK BLOCK
Suppose that we are in the state (k, x), i.e. in a moment k we observe an object X k , whose value is x, x ∈ (0, 1). There are two possibilities: x < d n−k and x ≥ d n−k .
Since the DM will not choose the object if x < d n−k let us consider the case when x ≥ d n−k . The PR can hide the object and demand a fixed price α. Therefore, his payoff is α. The DM has two possibilities. The first is to pay the tribute and stop at the object. His payoff is in this case Otherwise, he will continue the observations and earn The DM will pay the tribute if inequality ϕ 1,α (k, x) ≥ ϕ 2 (k, x) holds. This is equivalent to Note that the function on the right-hand side of the inequality is increasing as d n−k ≤ x ≤ 1. The set of the states when it is worth to pay the money is defined as where t n−k (α), α ∈ [0, 1] is the solution of the equation where D is defined in (9). The equality holds for α = 0. It implies that t n−k (α) ≥ d n−k . Let us assume that the DM does not know that the PR exists until he starts acting. He will pay the money if the observed chain of maximal elements falls into the set T (α) but does not fall into the stopping set D earlier. The probability of that event is where The PR's expected payoff is g(α) = α p(α).
Theorem 2 In the full-information best choice problem with a tribute the optimal strategy ρ * for the prompter exists and where α * = inf{α > 0 : g(α) = sup a g(a)}.

The limiting values
Let us analyze the properties of the payoff function for the PR as the number of observations tends to infinity. Suppose that i → ∞ and write Then, the price of the hint should satisfy the inequality The threshold limit is lim where t α is the unique solution of the equality The graph bellow presents the values of t α as a function of parameter α (Fig. 2).

Conclusion
In the world around us, despite the widespread access to information, there are still cases where certain information is obscured and not accessible. Access to them can be extremely valuable. This is not always possible, but there may be a kind of special occasion to buy. In such cases, the profitability of the purchase and the decision should be seriously considered. As a result of these considerations, the above models were created. The aim of the work was to construct a mathematical model describing the mechanism of obtaining additional information in various market situations. Usually, such information is secret, and the possibility of obtaining it is difficult. Hence, in the model, there is one prompter that has exclusive information. The model can be expanded. One of the possibilities is to introduce more than one decision maker to the game. Then, the prompter decides which player will offer the most. Another possibility is the appearance of more people wanting to sell information. In any case, the knowledge about each other must be considered. In this model, we have found the formula for the optimal price for the hint. It has been shown that the value of the hint has its limits. In the second model, the prompter behaves more like a ripper and blocks the ability to stop. Also here, you can extend the game with additional players. In the game above we have found an equilibrium price and the optimal strategy for the prompter. The limit for the tribute as the number of the observations goes to infinity has been derived.