Advertisement

On Following the Perturbed Leader in the Bandit Setting

  • Jussi Kujala
  • Tapio Elomaa
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3734)

Abstract

In an online decision problem an algorithm is at each time step required to choose one of the feasible points without knowing the cost associated with it. An adversary assigns the cost to possible decisions either obliviously or adaptively. The online algorithm, naturally, attempts to collect as little cost as possible. The cost difference of the online algorithm and the best static decision in hindsight is called the regret of the algorithm.

Kalai and Vempala [1] showed that it is possible to have efficient solutions to some problems with a linear cost function by following the perturbed leader. Their solution requires the costs of all decisions to be known.Recently there has also been some progress in the bandit setting, where only the cost of the selected decision is observed. A bound of O(T 2/3) on T rounds was first shown by Awerbuch and Kleinberg [2] for the regret against an oblivious adversary and later McMahan and Blum [3] showed that a bound of \(O(\sqrt{\ln T}T^{3/4})\) is obtainable against an adaptive adversary.

In this paper we study Kalai and Vempala’s model from the viewpoint of bandit algorithms. We show that the algorithm of McMahan and Blum attains a regret of O(T 2/3) against an oblivious adversary. Moreover, we show a tighter \(O(\sqrt{m\ln m}\sqrt{T})\) bound for the expert setting using m experts.

Keywords

Online Algorithm Decision Vector Cost Vector Perturbation Vector Exploitation Step 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Kalai, A., Vempala, S.: Efficient algorithms for online decision problems. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 26–40. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  2. 2.
    Awerbuch, B., Kleinberg, R.D.: Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In: Proceeding of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, pp. 45–53. ACM Press, New York (2004)CrossRefGoogle Scholar
  3. 3.
    McMahan, H.B., Blum, A.: Geometric optimization in the bandit setting against an adaptive adversary. In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS (LNAI), vol. 3120, pp. 109–123. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  4. 4.
    Sleator, D., Tarjan, R.: Self-adjusting binary search trees. Journal of the ACM 32, 652–686 (1985)zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Hannan, J.: Approximation to Bayes risk in repeated plays. In: Dresher, M., Tucker, A., Wolfe, P. (eds.) Contributions to the Theory of Games, vol. 3, pp. 97–139. Princeton University Press, Princeton (1957)Google Scholar
  6. 6.
    Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Fawcett, T., Mishra, N. (eds.) Proceeding of the Twentieth International Conference on Machine Learning, pp. 928–936. AAAI Press, Menlo Park (2003)Google Scholar
  7. 7.
    Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D., Schapire, R.E., Warmuth, M.K.: How to use expert advice. Journal of the ACM 44, 427–485 (1997)zbMATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Littlestone, N., Warmuth, M.K.: The weighted majority algorithm. Information and Computation 108, 212–261 (1994)zbMATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Takimoto, E., Warmuth, M.K.: Path kernels and multiplicative updates. Journal of Machine Learning Research 4, 773–818 (2003)CrossRefMathSciNetGoogle Scholar
  10. 10.
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The non-stochastic multi-armed bandit problem. SIAM Journal on Computing 32, 48–77 (2002)zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Flaxman, A.D., Kalai, A.T., McMahan, H.B.: Online convex optimization in the bandit setting: Gradient descent without a gradient. In: Proceeding of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 385–394. ACM Press, New York (2005)Google Scholar
  12. 12.
    Hutter, M., Poland, J.: Adaptive online prediction by following the perturbed leader. Journal of Machine Learning Research 6, 639–660 (2005)MathSciNetGoogle Scholar
  13. 13.
    Moore, E.H.: On the reciprocal of the general algebraic matrix (abstract). Bulletin of the American Mathematical Society 26, 394–395 (1920)Google Scholar
  14. 14.
    Penrose, R.A.: A generalized inverse for matrices. Proceedings of the Cambridge Philosophical Society 51, 406–413 (1955)zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Jussi Kujala
    • 1
  • Tapio Elomaa
    • 1
  1. 1.Institute of Software SystemsTampere University of TechnologyTampereFinland

Personalised recommendations