Skip to main content

Learning in a Game of Strategic Experimentation with Three-Armed Exponential Bandits

  • Chapter
  • First Online:
Book cover Frontiers of Dynamic Games

Part of the book series: Static & Dynamic Game Theory: Foundations & Applications ((SDGTFA))

  • 664 Accesses

Abstract

The present article provides some additional results for the two-player game of strategic experimentation with three-armed exponential bandits analyzed in Klein (Games Econ Behav 82:636–657, 2013). Players play replica bandits, with one safe arm and two risky arms, which are known to be of opposite types. It is initially unknown, however, which risky arm is good and which is bad. A good risky arm yields lump sums at exponentially distributed times when pulled. A bad risky arm never yields any payoff. In this article, I give a necessary and sufficient condition for the state of the world eventually to be found out with probability 1 in any Markov perfect equilibrium in which at least one player’s value function is continuously differentiable. Furthermore, I provide closed-form expressions for the players’ value function in a symmetric Markov perfect equilibrium for low and intermediate stakes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The multi-armed bandit model was first introduced by Thompson [10] and Robbins [9], and subsequently analyzed, amongst others, by Bradt et al. [4] and Bellman [1]. Gittins and Jones [5] provided the famous Gittins-index characterization of an optimal policy.

  2. 2.

    The utilitarian planner maximizes the sum of the players’ utilities. The solution to this problem is the policy the players would want to commit to at the outset of the game if they had commitment power. It thus constitutes a natural efficient benchmark against which to compare our equilibria.

  3. 3.

    By contrast, Bolton and Harris [2] identified an encouragement effect in their model. It makes players experiment at beliefs that are more pessimistic than their single-agent cutoffs. This is because they will receive good news with some probability, which will make the other players more optimistic also. This then induces them to provide more experimentation, from which the first player then benefits in turn. With fully revealing breakthroughs as in [6, 8], or this model, however, a player could not care less what others might do after a breakthrough, as there will not be anything left to learn. Therefore, there is no encouragement effect in these models.

  4. 4.

    The efficient solution in [6] also implies incomplete learning.

  5. 5.

    For perfect negative correlation, this is true in any equilibrium; for general negative correlation, there always exists an equilibrium with this property.

  6. 6.

    The technical requirement that at least one player’s value function be continuously differentiable is needed on account of complications pertaining to the admissibility of strategies. I use it in the proof of Lemma 4.1 to establish that the safe payoff s constitutes a lower bound on the player’s equilibrium value. However, by e.g. insisting on playing (1,  0) at a single belief \(\hat {p}\) while playing (0,  0) everywhere else in a neighborhood of \(\hat {p}\), a player could e.g. force the other player to play (0,  1) at \(\hat {p}\) for mere admissibility reasons. Thus, both players’ equilibrium value functions might be pushed below s at certain beliefs \(\hat {p}\). For the purposes of this section, I rule out such implausible behavior by restricting attention to equilibria in which at least one player’s value function is smooth.

  7. 7.

    See Prop.3.1 in [6].

  8. 8.

    See Proposition 8 in [8].

  9. 9.

    Strictly speaking, the first inequality relies on the admissibility of the action (0,  0) at \(\tilde {p}\). However, even if (0,  0) should not be admissible at \(\tilde {p}\), my definition of strategies still guarantees the existence of a neighborhood of \(\tilde {p}\) in which (0,  0) is admissible everywhere except at \(\tilde {p}\). Hence, by continuous differentiability of u, there exists a belief \(\tilde {\tilde {p}}\) in this neighborhood at which the same contradiction can be derived.

  10. 10.

    Again, strictly speaking, the first inequality relies on the admissibility of the action (1,  0) at the belief in question, and my previous remark applies.

References

  1. Bellman, R.: A problem in the sequential design of experiments. Sankhya Indian J. Stat. (1933–1960) 16(3/4), 221–229 (1956)

    Google Scholar 

  2. Bolton, P., Harris, C.: Strategic experimentation. Econometrica 67, 349–374 (1999)

    Article  MathSciNet  Google Scholar 

  3. Bolton, P., Harris, C.: Strategic experimentation: the Undiscounted case. In: Hammond, P.J., Myles, G.D. (eds.) Incentives, Organizations and Public Economics – Papers in Honour of Sir James Mirrlees, pp. 53–68. Oxford University Press, Oxford (2000)

    Google Scholar 

  4. Bradt, R., Johnson, S., Karlin, S.: On sequential designs for maximizing the sum of n observations. Ann. Math. Stat. 27, 1060–1074 (1956)

    Article  MathSciNet  Google Scholar 

  5. Gittins, J., Jones, D.: A dynamic allocation index for the sequential design of experiments. In: Progress in Statistics, European Meeting of Statisticians, 1972, vol. 1, pp. 241–266. North-Holland, Amsterdam (1974)

    Google Scholar 

  6. Keller G., Rady, S., Cripps, M.: Strategic experimentation with exponential bandits. Econometrica 73, 39–68 (2005)

    Article  MathSciNet  Google Scholar 

  7. Klein, N.: Strategic learning in teams. Games Econ. Behav. 82, 636–657 (2013)

    Article  MathSciNet  Google Scholar 

  8. Klein, N., Rady, S.: Negatively correlated bandits. Rev. Econ. Stud. 78, 693–732 (2011)

    Article  MathSciNet  Google Scholar 

  9. Robbins, H.: Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58, 527–535 (1952)

    Article  MathSciNet  Google Scholar 

  10. Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Klein, N. (2018). Learning in a Game of Strategic Experimentation with Three-Armed Exponential Bandits. In: Petrosyan, L., Mazalov, V., Zenkevich, N. (eds) Frontiers of Dynamic Games. Static & Dynamic Game Theory: Foundations & Applications. Birkhäuser, Cham. https://doi.org/10.1007/978-3-319-92988-0_4

Download citation

Publish with us

Policies and ethics