Advertisement

Online Learning with Variable Stage Duration

  • Shie Mannor
  • Nahum Shimkin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4005)

Abstract

We consider online learning in repeated decision problems, within the framework of a repeated game against an arbitrary opponent. For repeated matrix games, well known results establish the existence of no-regret strategies; such strategies secure a long-term average payoff that comes close to the maximal payoff that could be obtained, in hindsight, by playing any fixed action against the observed actions of the opponent. In the present paper we consider the extended model where the duration of each stage of the game may depend on the actions of both players, while the performance measure of interest is the average payoff per unit time. We start the analysis of online learning in repeated games with variable stage duration by showing that no-regret strategies, in the above sense, do not exist in general. Consequently, we consider two classes of adaptive strategies, one based on Blackwell’s approachability theorem and the other on calibrated forecasts, and examine their performance guarantees. In either case we show that the long-term average payoff is higher than a certain function of the empirical distribution of the opponent’s actions, and in particular is strictly higher than the minimax value of the repeated game whenever that empirical distribution deviates from a minimax strategy in the stage game.

Keywords

Online Learn Repeated Game Performance Guarantee Matrix Game Stage Game 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)CrossRefMathSciNetMATHGoogle Scholar
  2. 2.
    Blackwell, D.: An analog of the minimax theorem for vector payoffs. Pacific J. Math. 6(1), 1–8 (1956)MathSciNetMATHGoogle Scholar
  3. 3.
    Blackwell, D.: Controlled random walks. In: Proc. Int. Congress of Mathematicians 1954, vol. 3, pp. 336–338. North Holland, Amsterdam (1956)Google Scholar
  4. 4.
    Boyd, S., Vanderberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)MATHGoogle Scholar
  5. 5.
    Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, New York (2006)CrossRefMATHGoogle Scholar
  6. 6.
    Foster, D.P., Vohra, R.: Regret in the on-line decision problem. Games and Economic Behavior 29, 7–35 (1999)CrossRefMathSciNetMATHGoogle Scholar
  7. 7.
    Foster, D.P., Vohra, R.V.: Calibrated learning and correlated equilibrium. Games and Economic Behavior 21, 40–55 (1997)CrossRefMathSciNetMATHGoogle Scholar
  8. 8.
    Foster, D.P., Vohra, R.V.: Asymptotic calibration. Biometrika 85, 379–390 (1998)CrossRefMathSciNetMATHGoogle Scholar
  9. 9.
    Freund, Y., Schapire, R.E.: Adaptive game playing using multiplicative weights. Games and Economic Behavior 29, 79–103 (1999)CrossRefMathSciNetMATHGoogle Scholar
  10. 10.
    Fudenberg, D., Levine, D.: Universal consistency and cautious fictitious play. Journal of Economic Dynamic and Control 19, 1065–1990 (1995)CrossRefMathSciNetMATHGoogle Scholar
  11. 11.
    Fudenberg, D., Levine, D.: An easier way to calibrate. Games and Economic Behavior 29, 131–137 (1999)CrossRefMathSciNetMATHGoogle Scholar
  12. 12.
    Hannan, J.: Approximation to Bayes Risk in Repeated Play. Contribution to The Theory of Games, vol. III, pp. 97–139. Princeton University Press, Princeton (1957)Google Scholar
  13. 13.
    Kakade, S.M., Foster, D.P.: Deterministic calibration and nash equilibrium. In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS (LNAI), vol. 3120, pp. 33–48. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  14. 14.
    Lal, A.A., Sinha, S.: Zero-sum two-person semi-Markov games. J. Appl. Prob. 29, 56–72 (1992)CrossRefMathSciNetMATHGoogle Scholar
  15. 15.
    Mannor, S., Shimkin, N.: The empirical Bayes envelope and regret minimization in competitive Markov decision processes. Mathematics of Operations Research 28(2), 327–345 (2003)CrossRefMathSciNetMATHGoogle Scholar
  16. 16.
    Mannor, S., Shimkin, N.: Regret minimization in repeated matrix games with variable stage duration. Technical Report EE-1524, Faculty of Electrical Engineering, Technion (February 2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Shie Mannor
    • 1
  • Nahum Shimkin
    • 2
  1. 1.Department of Electrical and Computer EngingeeringMcGill UniversityQuébec
  2. 2.Department of Electrical EngineeringTechnion, Israel Institute of TechnologyHaifaIsrael

Personalised recommendations