General Discounting Versus Average Reward
Consider an agent interacting with an environment in cycles. In every interaction cycle the agent is rewarded for its performance. We compare the average reward U from cycle 1 to m (average value) with the future discounted reward V from cycle k to ∞ (discounted value). We consider essentially arbitrary (non-geometric) discount sequences and arbitrary reward sequences (non-MDP environments). We show that asymptotically U for m→∞ and V for k→∞ are equal, provided both limits exist. Further, if the effective horizon grows linearly with k or faster, then the existence of the limit of U implies that the limit of V exists. Conversely, if the effective horizon grows linearly with k or slower, then existence of the limit of V implies that the limit of U exists.
KeywordsAverage Reward Bandit Problem Discount Reward Reward Sequence Inconsistent Policy
Unable to display preview. Download preview PDF.
- [AA99]Avrachenkov, K.E., Altman, E.: Sensitive discount optimality via nested linear programs for ergodic Markov decision processes. In: Proceedings of Information Decision and Control 1999, Adelaide, Australia, pp. 53–58. IEEE, Los Alamitos (1999)Google Scholar
- [BT96]Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. In: Athena Scientific, Belmont, MA (1996)Google Scholar
- [KLM96]Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)Google Scholar
- [Mah96]Mahadevan, S.: Sensitive discount optimality: Unifying discounted and average reward reinforcement learning. In: Proc. 13th International Conference on Machine Learning, pp. 328–336. Morgan Kaufmann, San Francisco (1996)Google Scholar
- [RN03]Russell, S.J., Norvig, P.: Artificial Intelligence. A Modern Approach, 2nd edn. Prentice-Hall, Englewood Cliffs, NJ (2003)Google Scholar
- [SB98]Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA (1998)Google Scholar
- [Str56]Strotz, R.H.: Myopia and inconsistency in dynamic utility maximization. Review of Economic Studies 23, 165–180 (1955–1956)Google Scholar
- [VW04]Vieille, N., Weibull, J.W.: Dynamic optimization with non-exponential discounting: On the uniqueness of solutions. Technical Report WP No. 577, Department of Economics, Boston Univeristy, Boston, MA (2004)Google Scholar