Advertisement

Following the Perturbed Leader to Gamble at Multi-armed Bandits

  • Jussi Kujala
  • Tapio Elomaa
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4754)

Abstract

Following the perturbed leader (fpl) is a powerful technique for solving online decision problems. Kalai and Vempala [1] rediscovered this algorithm recently. A traditional model for online decision problems is the multi-armed bandit. In it a gambler has to choose at each round one of the k levers to pull with the intention to minimize the cumulated cost. There are four versions of the nonstochastic optimization setting out of which the most demanding one is a game played against an adaptive adversary in the bandit setting. An adaptive adversary may alter its game strategy of assigning costs to decisions depending on the decisions chosen by the gambler in the past. In the bandit setting the gambler only gets to know the cost of the choice he made, rather than the costs of all available alternatives. In this work we show that the very straightforward and easy to implement algorithm Adaptive Bandit fpl can attain a regret of \(O(\sqrt{T \ln T})\) against an adaptive adversary. This regret holds with respect to the best lever in hindsight and matches the previous best regret bounds of \(O(\sqrt{T \ln T})\).

Keywords

Slot Machine Cost Vector Adaptive Adversary Good Lever Oblivious Adversary 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Kalai, A.T., Vempala, S.: Efficient algorithms for online decision problems. Journal of Computer and System Sciences 71(3), 26–40 (2005)MathSciNetzbMATHGoogle Scholar
  2. 2.
    Hannan, J.: Approximation to Bayes risk in repeated plays. In: Dresher, M., Tucker, A., Wolfe, P. (eds.) Contributions to the Theory of Games, vol. 3, pp. 97–139. Princeton University Press, Princeton (1957)Google Scholar
  3. 3.
    Kujala, J., Elomaa, T.: On following the perturbed leader in the bandit setting. In: Jain, S., Simon, H.U., Tomita, E. (eds.) ALT 2005. LNCS (LNAI), vol. 3734, pp. 371–385. Springer, Heidelberg (2005)Google Scholar
  4. 4.
    Dani, V., Hayes, T.P.: How to beat the adaptive multi-armed bandit. Technical report, Cornell University (2006), http://arxiv.org/cs.DS/602053
  5. 5.
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The non-stochastic multi-armed bandit problem. SIAM Journal on Computing 32(1), 48–77 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Auer, P.: Using upper confidence bounds for online learning. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pp. 270–279. IEEE Computer Society Press, Los Alamitos (2000)CrossRefGoogle Scholar
  7. 7.
    Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society 58, 527–535 (1952)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, Cambridge (2006)CrossRefzbMATHGoogle Scholar
  10. 10.
    Littlestone, N., Warmuth, M.K.: The weighted majority algorithm. Information and Computation 108(2), 212–261 (1994)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Dani, V., Hayes, T.P.: Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary. In: Proceeding of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 937–943. ACM Press, New York (2006)Google Scholar
  12. 12.
    Hutter, M., Poland, J.: Adaptive online prediction by following the perturbed leader. Journal of Machine Learning Research 6, 639–660 (2005)MathSciNetzbMATHGoogle Scholar
  13. 13.
    Kakade, S., Kalai, A.T., Ligett, K.: Approximation algorithms going online. Technical Report CMU-CS-07-102, Carnegie Mellon University, reports-archive.adm.cs.cmu.edu/anon/2007/CMU-CS-07-102.pdf (2007)Google Scholar
  14. 14.
    Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Fawcett, T., Mishra, N. (eds.) Proceeding of the 20th International Conference on Machine Learning, Menlo Park, pp. 928–936. AAAI Press (2003)Google Scholar
  15. 15.
    Awerbuch, B., Kleinberg, R.: Near-optimal adaptive routing: Shortest paths and geometric generalizations. In: Proceeding of the 36th Annual ACM Symposium on Theory of Computing, pp. 45–53. ACM Press, New York (2004)Google Scholar
  16. 16.
    McMahan, H.B., Blum, A.: Geometric optimization in the bandit setting against an adaptive adversary. In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS (LNAI), vol. 3120, pp. 109–123. Springer, Heidelberg (2004)Google Scholar
  17. 17.
    György, A., Linder, T., Lugosi, G.: Tracking the best of many experts. In: Auer, P., Meir, R. (eds.) COLT 2005. LNCS (LNAI), vol. 3559, pp. 204–216. Springer, Heidelberg (2005)Google Scholar
  18. 18.
    McDiarmid, C.: Concentration. In: Habib, M., McDiarmid, C., Ramirez-Alfonsin, J., Reed, B. (eds.) Probabilistic Methods for Algorithmic Discrete Mathematics, pp. 195–248. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  19. 19.
    Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 437–448. Springer, Heidelberg (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Jussi Kujala
    • 1
  • Tapio Elomaa
    • 1
  1. 1.Institute of Software Systems, Tampere University of Technology, P. O. Box 553, FI-33101 TampereFinland

Personalised recommendations