Skip to main content

General Discounting Versus Average Reward

  • Conference paper
Algorithmic Learning Theory (ALT 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4264))

Included in the following conference series:

Abstract

Consider an agent interacting with an environment in cycles. In every interaction cycle the agent is rewarded for its performance. We compare the average reward U from cycle 1 to m (average value) with the future discounted reward V from cycle k to ∞ (discounted value). We consider essentially arbitrary (non-geometric) discount sequences and arbitrary reward sequences (non-MDP environments). We show that asymptotically U for m→∞ and V for k→∞ are equal, provided both limits exist. Further, if the effective horizon grows linearly with k or faster, then the existence of the limit of U implies that the limit of V exists. Conversely, if the effective horizon grows linearly with k or slower, then existence of the limit of V implies that the limit of U exists.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Avrachenkov, K.E., Altman, E.: Sensitive discount optimality via nested linear programs for ergodic Markov decision processes. In: Proceedings of Information Decision and Control 1999, Adelaide, Australia, pp. 53–58. IEEE, Los Alamitos (1999)

    Google Scholar 

  2. Berry, D.A., Fristedt, B.: Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London (1985)

    MATH  Google Scholar 

  3. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. In: Athena Scientific, Belmont, MA (1996)

    Google Scholar 

  4. Frederick, S., Loewenstein, G., O’Donoghue, T.: Time discounting and time preference: A critical review. Journal of Economic Literature 40, 351–401 (2002)

    Article  Google Scholar 

  5. Hutter, M.: Self-optimizing and pareto-optimal policies in general environments based on bayes-mixtures. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI), vol. 2375, pp. 364–379. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  6. Hutter, M.: Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability, p. 300. Springer, Berlin (2005), http://www.idsia.ch/~marcus/ai/uaibook.htm

    MATH  Google Scholar 

  7. Kakade, S.: Optimizing average reward using discounted rewards. In: Helmbold, D.P., Williamson, B. (eds.) COLT 2001 and EuroCOLT 2001. LNCS (LNAI), vol. 2111, pp. 605–615. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  8. Kelly, F.P.: Multi-armed bandits with discount factor near one: The Bernoulli case. Annals of Statistics 9, 987–1001 (1981)

    Article  MATH  MathSciNet  Google Scholar 

  9. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)

    Google Scholar 

  10. Kumar, P.R., Varaiya, P.P.: Stochastic Systems: Estimation, Identification, and Adaptive Control. Prentice Hall, Englewood Cliffs, NJ (1986)

    MATH  Google Scholar 

  11. Mahadevan, S.: Sensitive discount optimality: Unifying discounted and average reward reinforcement learning. In: Proc. 13th International Conference on Machine Learning, pp. 328–336. Morgan Kaufmann, San Francisco (1996)

    Google Scholar 

  12. Russell, S.J., Norvig, P.: Artificial Intelligence. A Modern Approach, 2nd edn. Prentice-Hall, Englewood Cliffs, NJ (2003)

    Google Scholar 

  13. Samuelson, P.: A note on measurement of utility. Review of Economic Studies 4, 155–161 (1937)

    Article  Google Scholar 

  14. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA (1998)

    Google Scholar 

  15. Strotz, R.H.: Myopia and inconsistency in dynamic utility maximization. Review of Economic Studies 23, 165–180 (1955–1956)

    Google Scholar 

  16. Vieille, N., Weibull, J.W.: Dynamic optimization with non-exponential discounting: On the uniqueness of solutions. Technical Report WP No. 577, Department of Economics, Boston Univeristy, Boston, MA (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hutter, M. (2006). General Discounting Versus Average Reward. In: Balcázar, J.L., Long, P.M., Stephan, F. (eds) Algorithmic Learning Theory. ALT 2006. Lecture Notes in Computer Science(), vol 4264. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11894841_21

Download citation

  • DOI: https://doi.org/10.1007/11894841_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-46649-9

  • Online ISBN: 978-3-540-46650-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics