Skip to main content

Approximate Indexability and Bandit Problems with Concave Rewards and Delayed Feedback

  • Conference paper
  • 1832 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8096))

Abstract

We consider two stochastic multi-armed bandit problems in this paper in the Bayesian setting. In the first problem the accrued reward in a step is a concave function (such as the maximum) of the observed values of the arms played in that step. In the second problem, the observed value from a play of arm i is revealed after δ i steps. Both of these problems have been considered in the bandit literature but no solutions with provably good performance guarantees are known over short horizons. The two problems are similar in the sense that the reward (for the first) or the available information (for the second) derived from an arm is not a function of just the current play of that arm. This interdependence between arms renders most existing analysis techniques inapplicable.

A fundamental question in regard to optimization in the bandit setting is indexability, i.e., the existence of near optimal index policies. Index policies in these contexts correspond to policies over suitable single arm state spaces which are combined into a global policy. They are extremely desirable due to their simplicity and perceived robustness, but standard index policies do not provide any guarantees for these two problems. We construct O(1) approximate (near) index policies in polynomial time for both problems. The analysis identifies a suitable subset of states for each arm such that index policies that focus on only those subsets are O(1)-approximate.

This paper subsumes the unpublished manucript [23].

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agarwal, D., Chen, B.-C., Elango, P.: Explore/exploit schemes for web content optimization. In: Proceedings of ICDM, pp. 1–10 (2009)

    Google Scholar 

  2. Akella, A., Maggs, B., Seshan, S., Shaikh, A., Sitaraman, R.: A measurement-based analysis of multihoming. In: Proceedings of SIGCOMM, pp. 353–364 (2003)

    Google Scholar 

  3. Alaei, S., Malekian, A.: Maximizing sequence-submodular functions and its application to online advertising. CoRR abs/1009.4153 (2010)

    Google Scholar 

  4. Anderson, T.W.: Sequential analysis with delayed observations. Journal of the American Statistical Association 59(308), 1006–1015 (1964)

    Article  MathSciNet  MATH  Google Scholar 

  5. Armitage, P.: The search for optimality in clinical trials. International Statistical Review 53(1), 15–24 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  6. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3), 235–256 (2002)

    Article  MATH  Google Scholar 

  7. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2003)

    Article  MathSciNet  Google Scholar 

  8. Bertsekas, D.: Dynamic Programming and Optimal Control, 2nd edn. Athena Scientific (2001)

    Google Scholar 

  9. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning and Games. Cambridge University Press (2006)

    Google Scholar 

  10. Choi, S.C., Clark, V.A.: Sequential decision for a binomial parameter with delayed observations. Biometrics 26(3), 411–420 (1970)

    Article  Google Scholar 

  11. Demberel, A., Chase, J., Babu, S.: Reflective control for an elastic cloud appliation: An automated experiment workbench. In: Proc. of HotCloud (2009)

    Google Scholar 

  12. Ehrenfeld, S.: On a scheduling problem in sequential analysis. The Annals of Mathematical Statistics 41(4), 1206–1216 (1970)

    Article  MathSciNet  MATH  Google Scholar 

  13. Eick, S.G.: The two-armed bandit with delayed responses. The Annals of Statistics 16(1), 254–264 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  14. Even-Dar, E., Kearns, M., Wortman, J.: Risk-sensitive online learning. In: Balcázar, J.L., Long, P.M., Stephan, F. (eds.) ALT 2006. LNCS (LNAI), vol. 4264, pp. 199–213. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  15. Farias, V.F., Madan, R.: The irrevocable multiarmed bandit problem. Operations Research 59(2), 383–399 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  16. Gittins, J., Glazebrook, K., Weber, R.: Multi-Armed Bandit Allocation Indices. Wiley (2011)

    Google Scholar 

  17. Gittins, J.C., Jones, D.M.: A dynamic allocation index for the sequential design of experiments. In: Progress in Statistics (European Meeting of Statisticians) (1972)

    Google Scholar 

  18. Goel, A., Guha, S., Munagala, K.: How to probe for an extreme value. ACM Transactions of Algorithms 7(1), 12:1–12:20 (2010)

    Google Scholar 

  19. Golovin, D., Krause, A.: Adaptive submodular optimization under matroid constraints. CoRR abs/1101.4450 (2011)

    Google Scholar 

  20. Guha, S., Munagala, K.: Approximation algorithms for budgeted learning problems. In: Proc. of STOC (2007)

    Google Scholar 

  21. Guha, S., Munagala, K.: Multi-armed bandits with metric switching costs. In: Albers, S., Marchetti-Spaccamela, A., Matias, Y., Nikoletseas, S., Thomas, W. (eds.) ICALP 2009, Part II. LNCS, vol. 5556, pp. 496–507. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  22. Guha, S., Munagala, K.: Approximation algorithms for bayesian multi-armed bandit problems. CoRR, full version, also subsumes [20, 21] and [23] (2013)

    Google Scholar 

  23. Guha, S., Munagala, K., Pál, M.: Iterated allocations with delayed feedback. Manuscript, available at CoRR (2010), http://arxiv.org/abs/1011.1161

  24. Gupta, A., Krishnaswamy, R., Molinaro, M., Ravi, R.: Approximation algorithms for correlated knapsacks and non-martingale bandits. In: Proc. of FOCS (2011)

    Google Scholar 

  25. Jain, K., Vazirani, V.V.: Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation. J. ACM 48(2), 274–296 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  26. Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 4–22 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  27. Lee, J., Sviridenko, M., Vondrk, J.: Submodular maximization over multiple matroids via generalized exchange properties. Math. Oper. Res. 35(4), 795–806 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  28. Ny, J.L., Dahleh, M., Feron, E.: Multi-UAV dynamic routing with partial observations using restless bandits allocation indices. In: Proceedings of the 2008 American Control Conference (2008)

    Google Scholar 

  29. Simon, R.: Adaptive treatment assignment methods and clinical trials. Biometrics 33(4), 743–749 (1977)

    Article  Google Scholar 

  30. Streeter, M.J., Golovin, D.: An online algorithm for maximizing submodular functions. In: NIPS, pp. 1577–1584 (2008)

    Google Scholar 

  31. Streeter, M.J., Smith, S.F.: An asymptotically optimal algorithm for the max k-armed bandit problem. In: Proceedings of AAAI (2006)

    Google Scholar 

  32. Suzuki, Y.: On sequential decision problems with delayed observations. Annals of the Institute of Statistical Mathematics 18(1), 229–267 (1966)

    Article  MathSciNet  MATH  Google Scholar 

  33. Tian, F., DeWitt, D.J.: Tuple routing strategies for distributed eddies. In: Proc. of VLDB (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Guha, S., Munagala, K. (2013). Approximate Indexability and Bandit Problems with Concave Rewards and Delayed Feedback. In: Raghavendra, P., Raskhodnikova, S., Jansen, K., Rolim, J.D.P. (eds) Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. APPROX RANDOM 2013 2013. Lecture Notes in Computer Science, vol 8096. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40328-6_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40328-6_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40327-9

  • Online ISBN: 978-3-642-40328-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics