Online Learning with Constraints
We study online learning where the objective of the decision maker is to maximize her average long-term reward given that some average constraints are satisfied along the sample path. We define the reward-in-hindsight as the highest reward the decision maker could have achieved, while satisfying the constraints, had she known Nature’s choices in advance. We show that in general the reward-in-hindsight is not attainable. The convex hull of the reward-in-hindsight function is, however, attainable. For the important case of a single constraint the convex hull turns out to be the highest attainable function. We further provide an explicit strategy that attains this convex hull using a calibrated forecasting rule.
KeywordsConvex Hull Online Learn Markov Decision Process Stochastic Game Average Reward
Unable to display preview. Download preview PDF.
- 1.Hannan, J.: Approximation to Bayes Risk in Repeated Play. Contribution to The Theory of Games, vol. III, pp. 97–139. Princeton University Press, Princeton (1957)Google Scholar
- 4.Shimkin, N.: Stochastic games with average cost constraints. In: Basar, T., Haurie, A. (eds.) Advances in Dynamic Games and Applications, pp. 219–230. Birkhäuser, Basel (1994)Google Scholar
- 7.Blackwell, D.: Controlled random walks. In: Proc. Int. Congress of Mathematicians 1954, vol. 3, pp. 336–338. North Holland, Amsterdam (1956)Google Scholar
- 9.Mertens, J.F., Sorin, S., Zamir, S.: Repeated games. CORE Reprint Dps 9420, 9421 and 9422, Center for Operation Research and Econometrics, Universite Catholique De Louvain, Belgium (1994)Google Scholar