Learning in a Game of Strategic Experimentation with Three-Armed Exponential Bandits

Klein, Nicolas

doi:10.1007/978-3-319-92988-0_4

Nicolas Klein¹³

Part of the book series: Static & Dynamic Game Theory: Foundations & Applications ((SDGTFA))

664 Accesses

Abstract

The present article provides some additional results for the two-player game of strategic experimentation with three-armed exponential bandits analyzed in Klein (Games Econ Behav 82:636–657, 2013). Players play replica bandits, with one safe arm and two risky arms, which are known to be of opposite types. It is initially unknown, however, which risky arm is good and which is bad. A good risky arm yields lump sums at exponentially distributed times when pulled. A bad risky arm never yields any payoff. In this article, I give a necessary and sufficient condition for the state of the world eventually to be found out with probability 1 in any Markov perfect equilibrium in which at least one player’s value function is continuously differentiable. Furthermore, I provide closed-form expressions for the players’ value function in a symmetric Markov perfect equilibrium for low and intermediate stakes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The multi-armed bandit model was first introduced by Thompson [10] and Robbins [9], and subsequently analyzed, amongst others, by Bradt et al. [4] and Bellman [1]. Gittins and Jones [5] provided the famous Gittins-index characterization of an optimal policy.
2.
The utilitarian planner maximizes the sum of the players’ utilities. The solution to this problem is the policy the players would want to commit to at the outset of the game if they had commitment power. It thus constitutes a natural efficient benchmark against which to compare our equilibria.
3.
By contrast, Bolton and Harris [2] identified an encouragement effect in their model. It makes players experiment at beliefs that are more pessimistic than their single-agent cutoffs. This is because they will receive good news with some probability, which will make the other players more optimistic also. This then induces them to provide more experimentation, from which the first player then benefits in turn. With fully revealing breakthroughs as in [6, 8], or this model, however, a player could not care less what others might do after a breakthrough, as there will not be anything left to learn. Therefore, there is no encouragement effect in these models.
4.
The efficient solution in [6] also implies incomplete learning.
5.
For perfect negative correlation, this is true in any equilibrium; for general negative correlation, there always exists an equilibrium with this property.
6.
The technical requirement that at least one player’s value function be continuously differentiable is needed on account of complications pertaining to the admissibility of strategies. I use it in the proof of Lemma 4.1 to establish that the safe payoff s constitutes a lower bound on the player’s equilibrium value. However, by e.g. insisting on playing (1, 0) at a single belief \(\hat {p}\) while playing (0, 0) everywhere else in a neighborhood of \(\hat {p}\), a player could e.g. force the other player to play (0, 1) at \(\hat {p}\) for mere admissibility reasons. Thus, both players’ equilibrium value functions might be pushed below s at certain beliefs \(\hat {p}\). For the purposes of this section, I rule out such implausible behavior by restricting attention to equilibria in which at least one player’s value function is smooth.
7.
See Prop.3.1 in [6].
8.
See Proposition 8 in [8].
9.
Strictly speaking, the first inequality relies on the admissibility of the action (0, 0) at \(\tilde {p}\). However, even if (0, 0) should not be admissible at \(\tilde {p}\), my definition of strategies still guarantees the existence of a neighborhood of \(\tilde {p}\) in which (0, 0) is admissible everywhere except at \(\tilde {p}\). Hence, by continuous differentiability of u, there exists a belief \(\tilde {\tilde {p}}\) in this neighborhood at which the same contradiction can be derived.
10.
Again, strictly speaking, the first inequality relies on the admissibility of the action (1, 0) at the belief in question, and my previous remark applies.

References

Bellman, R.: A problem in the sequential design of experiments. Sankhya Indian J. Stat. (1933–1960) 16(3/4), 221–229 (1956)
Google Scholar
Bolton, P., Harris, C.: Strategic experimentation. Econometrica 67, 349–374 (1999)
Article MathSciNet Google Scholar
Bolton, P., Harris, C.: Strategic experimentation: the Undiscounted case. In: Hammond, P.J., Myles, G.D. (eds.) Incentives, Organizations and Public Economics – Papers in Honour of Sir James Mirrlees, pp. 53–68. Oxford University Press, Oxford (2000)
Google Scholar
Bradt, R., Johnson, S., Karlin, S.: On sequential designs for maximizing the sum of n observations. Ann. Math. Stat. 27, 1060–1074 (1956)
Article MathSciNet Google Scholar
Gittins, J., Jones, D.: A dynamic allocation index for the sequential design of experiments. In: Progress in Statistics, European Meeting of Statisticians, 1972, vol. 1, pp. 241–266. North-Holland, Amsterdam (1974)
Google Scholar
Keller G., Rady, S., Cripps, M.: Strategic experimentation with exponential bandits. Econometrica 73, 39–68 (2005)
Article MathSciNet Google Scholar
Klein, N.: Strategic learning in teams. Games Econ. Behav. 82, 636–657 (2013)
Article MathSciNet Google Scholar
Klein, N., Rady, S.: Negatively correlated bandits. Rev. Econ. Stud. 78, 693–732 (2011)
Article MathSciNet Google Scholar
Robbins, H.: Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58, 527–535 (1952)
Article MathSciNet Google Scholar
Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Université de Montréal and CIREQ, Département de Sciences Économiques, Montréal, QC, Canada
Nicolas Klein

Authors

Nicolas Klein
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

St. Petersburg State University, St. Petersburg, Russia
Leon A. Petrosyan
Institute of Applied Mathematical Research, Karelian Research Center of RAS, Petrozavodsk, Russia
Vladimir V. Mazalov
Graduate School of Management, St. Petersburg State University, St. Petersburg, Russia
Nikolay A. Zenkevich

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Klein, N. (2018). Learning in a Game of Strategic Experimentation with Three-Armed Exponential Bandits. In: Petrosyan, L., Mazalov, V., Zenkevich, N. (eds) Frontiers of Dynamic Games. Static & Dynamic Game Theory: Foundations & Applications. Birkhäuser, Cham. https://doi.org/10.1007/978-3-319-92988-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-92988-0_4
Published: 18 July 2018
Publisher Name: Birkhäuser, Cham
Print ISBN: 978-3-319-92987-3
Online ISBN: 978-3-319-92988-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics