Online Learning with Constraints

  • Shie Mannor
  • John N. Tsitsiklis
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4005)


We study online learning where the objective of the decision maker is to maximize her average long-term reward given that some average constraints are satisfied along the sample path. We define the reward-in-hindsight as the highest reward the decision maker could have achieved, while satisfying the constraints, had she known Nature’s choices in advance. We show that in general the reward-in-hindsight is not attainable. The convex hull of the reward-in-hindsight function is, however, attainable. For the important case of a single constraint the convex hull turns out to be the highest attainable function. We further provide an explicit strategy that attains this convex hull using a calibrated forecasting rule.


Convex Hull Online Learn Markov Decision Process Stochastic Game Average Reward 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Hannan, J.: Approximation to Bayes Risk in Repeated Play. Contribution to The Theory of Games, vol. III, pp. 97–139. Princeton University Press, Princeton (1957)Google Scholar
  2. 2.
    Mannor, S., Shimkin, N.: A geometric approach to multi-criterion reinforcement learning. Journal of Machine Learning Research 5, 325–360 (2004)MathSciNetGoogle Scholar
  3. 3.
    Altman, E.: Constrained Markov Decision Processes. Chapman and Hall, Boca Raton (1999)MATHGoogle Scholar
  4. 4.
    Shimkin, N.: Stochastic games with average cost constraints. In: Basar, T., Haurie, A. (eds.) Advances in Dynamic Games and Applications, pp. 219–230. Birkhäuser, Basel (1994)Google Scholar
  5. 5.
    Blackwell, D.: An analog of the minimax theorem for vector payoffs. Pacific J. Math. 6(1), 1–8 (1956)MathSciNetMATHGoogle Scholar
  6. 6.
    Spinat, X.: A necessary and sufficient condition for approachability. Mathematics of Operations Research 27(1), 31–44 (2002)CrossRefMathSciNetMATHGoogle Scholar
  7. 7.
    Blackwell, D.: Controlled random walks. In: Proc. Int. Congress of Mathematicians 1954, vol. 3, pp. 336–338. North Holland, Amsterdam (1956)Google Scholar
  8. 8.
    Mannor, S., Shimkin, N.: The empirical Bayes envelope and regret minimization in competitive Markov decision processes. Mathematics of Operations Research 28(2), 327–345 (2003)CrossRefMathSciNetMATHGoogle Scholar
  9. 9.
    Mertens, J.F., Sorin, S., Zamir, S.: Repeated games. CORE Reprint Dps 9420, 9421 and 9422, Center for Operation Research and Econometrics, Universite Catholique De Louvain, Belgium (1994)Google Scholar
  10. 10.
    Foster, D.P., Vohra, R.V.: Calibrated learning and correlated equilibrium. Games and Economic Behavior 21, 40–55 (1997)CrossRefMathSciNetMATHGoogle Scholar
  11. 11.
    Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, New York (2006)CrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Shie Mannor
    • 1
  • John N. Tsitsiklis
    • 2
  1. 1.Department of Electrical and Computer EngingeeringMcGill UniversityQuébec
  2. 2.Laboratory for Information and Decision SystemsMassachusetts Institute of TechnologyCambridge

Personalised recommendations