General Discounting Versus Average Reward

Hutter, Marcus

doi:10.1007/11894841_21

Marcus Hutter²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4264))

Included in the following conference series:

International Conference on Algorithmic Learning Theory

774 Accesses
3 Citations

Abstract

Consider an agent interacting with an environment in cycles. In every interaction cycle the agent is rewarded for its performance. We compare the average reward U from cycle 1 to m (average value) with the future discounted reward V from cycle k to ∞ (discounted value). We consider essentially arbitrary (non-geometric) discount sequences and arbitrary reward sequences (non-MDP environments). We show that asymptotically U for m→∞ and V for k→∞ are equal, provided both limits exist. Further, if the effective horizon grows linearly with k or faster, then the existence of the limit of U implies that the limit of V exists. Conversely, if the effective horizon grows linearly with k or slower, then existence of the limit of V implies that the limit of U exists.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Avrachenkov, K.E., Altman, E.: Sensitive discount optimality via nested linear programs for ergodic Markov decision processes. In: Proceedings of Information Decision and Control 1999, Adelaide, Australia, pp. 53–58. IEEE, Los Alamitos (1999)
Google Scholar
Berry, D.A., Fristedt, B.: Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London (1985)
MATH Google Scholar
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. In: Athena Scientific, Belmont, MA (1996)
Google Scholar
Frederick, S., Loewenstein, G., O’Donoghue, T.: Time discounting and time preference: A critical review. Journal of Economic Literature 40, 351–401 (2002)
Article Google Scholar
Hutter, M.: Self-optimizing and pareto-optimal policies in general environments based on bayes-mixtures. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI), vol. 2375, pp. 364–379. Springer, Heidelberg (2002)
Chapter Google Scholar
Hutter, M.: Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability, p. 300. Springer, Berlin (2005), http://www.idsia.ch/~marcus/ai/uaibook.htm
MATH Google Scholar
Kakade, S.: Optimizing average reward using discounted rewards. In: Helmbold, D.P., Williamson, B. (eds.) COLT 2001 and EuroCOLT 2001. LNCS (LNAI), vol. 2111, pp. 605–615. Springer, Heidelberg (2001)
Chapter Google Scholar
Kelly, F.P.: Multi-armed bandits with discount factor near one: The Bernoulli case. Annals of Statistics 9, 987–1001 (1981)
Article MATH MathSciNet Google Scholar
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)
Google Scholar
Kumar, P.R., Varaiya, P.P.: Stochastic Systems: Estimation, Identification, and Adaptive Control. Prentice Hall, Englewood Cliffs, NJ (1986)
MATH Google Scholar
Mahadevan, S.: Sensitive discount optimality: Unifying discounted and average reward reinforcement learning. In: Proc. 13th International Conference on Machine Learning, pp. 328–336. Morgan Kaufmann, San Francisco (1996)
Google Scholar
Russell, S.J., Norvig, P.: Artificial Intelligence. A Modern Approach, 2nd edn. Prentice-Hall, Englewood Cliffs, NJ (2003)
Google Scholar
Samuelson, P.: A note on measurement of utility. Review of Economic Studies 4, 155–161 (1937)
Article Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA (1998)
Google Scholar
Strotz, R.H.: Myopia and inconsistency in dynamic utility maximization. Review of Economic Studies 23, 165–180 (1955–1956)
Google Scholar
Vieille, N., Weibull, J.W.: Dynamic optimization with non-exponential discounting: On the uniqueness of solutions. Technical Report WP No. 577, Department of Economics, Boston Univeristy, Boston, MA (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

IDSIA / RSISE / ANU / NICTA /,
Marcus Hutter

Authors

Marcus Hutter
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departament de Llenguatges i Sistemes Informàtics Laboratori d’Algorísmica Relacional, Complexitat i Aprenentatge, Universitat Politècnica de Catalunya, Barcelona,
José L. Balcázar
Google, 1600 Amphitheatre Parkway, 94043, Mountain View, CA, USA
Philip M. Long
Department of Computer Science and Department of Mathematics, National University of Singapore, 117543, Singapore, Republic of Singapore
Frank Stephan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hutter, M. (2006). General Discounting Versus Average Reward. In: Balcázar, J.L., Long, P.M., Stephan, F. (eds) Algorithmic Learning Theory. ALT 2006. Lecture Notes in Computer Science(), vol 4264. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11894841_21

Download citation

DOI: https://doi.org/10.1007/11894841_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46649-9
Online ISBN: 978-3-540-46650-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics