Advertisement

General Discounting Versus Average Reward

  • Marcus Hutter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4264)

Abstract

Consider an agent interacting with an environment in cycles. In every interaction cycle the agent is rewarded for its performance. We compare the average reward U from cycle 1 to m (average value) with the future discounted reward V from cycle k to ∞ (discounted value). We consider essentially arbitrary (non-geometric) discount sequences and arbitrary reward sequences (non-MDP environments). We show that asymptotically U for m→∞ and V for k→∞ are equal, provided both limits exist. Further, if the effective horizon grows linearly with k or faster, then the existence of the limit of U implies that the limit of V exists. Conversely, if the effective horizon grows linearly with k or slower, then existence of the limit of V implies that the limit of U exists.

Keywords

Average Reward Bandit Problem Discount Reward Reward Sequence Inconsistent Policy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [AA99]
    Avrachenkov, K.E., Altman, E.: Sensitive discount optimality via nested linear programs for ergodic Markov decision processes. In: Proceedings of Information Decision and Control 1999, Adelaide, Australia, pp. 53–58. IEEE, Los Alamitos (1999)Google Scholar
  2. [BF85]
    Berry, D.A., Fristedt, B.: Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London (1985)zbMATHGoogle Scholar
  3. [BT96]
    Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. In: Athena Scientific, Belmont, MA (1996)Google Scholar
  4. [FLO02]
    Frederick, S., Loewenstein, G., O’Donoghue, T.: Time discounting and time preference: A critical review. Journal of Economic Literature 40, 351–401 (2002)CrossRefGoogle Scholar
  5. [Hut02]
    Hutter, M.: Self-optimizing and pareto-optimal policies in general environments based on bayes-mixtures. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI), vol. 2375, pp. 364–379. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  6. [Hut05]
    Hutter, M.: Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability, p. 300. Springer, Berlin (2005), http://www.idsia.ch/~marcus/ai/uaibook.htm zbMATHGoogle Scholar
  7. [Kak01]
    Kakade, S.: Optimizing average reward using discounted rewards. In: Helmbold, D.P., Williamson, B. (eds.) COLT 2001 and EuroCOLT 2001. LNCS (LNAI), vol. 2111, pp. 605–615. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  8. [Kel81]
    Kelly, F.P.: Multi-armed bandits with discount factor near one: The Bernoulli case. Annals of Statistics 9, 987–1001 (1981)zbMATHCrossRefMathSciNetGoogle Scholar
  9. [KLM96]
    Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)Google Scholar
  10. [KV86]
    Kumar, P.R., Varaiya, P.P.: Stochastic Systems: Estimation, Identification, and Adaptive Control. Prentice Hall, Englewood Cliffs, NJ (1986)zbMATHGoogle Scholar
  11. [Mah96]
    Mahadevan, S.: Sensitive discount optimality: Unifying discounted and average reward reinforcement learning. In: Proc. 13th International Conference on Machine Learning, pp. 328–336. Morgan Kaufmann, San Francisco (1996)Google Scholar
  12. [RN03]
    Russell, S.J., Norvig, P.: Artificial Intelligence. A Modern Approach, 2nd edn. Prentice-Hall, Englewood Cliffs, NJ (2003)Google Scholar
  13. [Sam37]
    Samuelson, P.: A note on measurement of utility. Review of Economic Studies 4, 155–161 (1937)CrossRefGoogle Scholar
  14. [SB98]
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA (1998)Google Scholar
  15. [Str56]
    Strotz, R.H.: Myopia and inconsistency in dynamic utility maximization. Review of Economic Studies 23, 165–180 (1955–1956)Google Scholar
  16. [VW04]
    Vieille, N., Weibull, J.W.: Dynamic optimization with non-exponential discounting: On the uniqueness of solutions. Technical Report WP No. 577, Department of Economics, Boston Univeristy, Boston, MA (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Marcus Hutter
    • 1
  1. 1.IDSIA / RSISE / ANU / NICTA / 

Personalised recommendations