A polynomial time bound for Howard's policy improvement algorithm

Meister, U.; Holzbaur, U.

doi:10.1007/BF01720771

A polynomial time bound for Howard's policy improvement algorithm

Theoretical Papers
Published: 01 March 1986

Volume 8, pages 37–40, (1986)
Cite this article

Operations-Research-Spektrum Aims and scope Submit manuscript

U. Meister¹ &
U. Holzbaur¹

85 Accesses
4 Citations
Explore all metrics

Summary

We consider a discounted Markovian Decision Process (MDP) with finite state and action space. For a fixed discount factor we derive a bound for the number of steps, taken by Howard's policy improvement algorithm (PIA) to determine an optimal policy for the MDP, that is essentially polynomial in the number of states and actions of the MDP. The main tools are the contraction properties of the PIA and a lower bound for the difference of the value functions of a MDP with rational data.

Zusammenfassung

Wir betrachten einen Markoffschen Entscheidungsprozeß mit endlichem Zustands- und Aktionenraum. Bei festgehaltenem Diskontierungsfaktor bestimmen wir eine Grenze für die Anzahl der Schritte in Howards Politikverbesserungsverfahren, die im wesentlichen polynomial in der Anzahl der Zustände und Aktionen ist. Die Haupthilfsmittel sind dabei die Kontraktionseigenschaft des Algorithmus und eine untere

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

De Ghellinck GT (1960) Les problèmes de décisions sequentielles. Cah Centr Etud Rech Opérat 2:161–179
Google Scholar
Hinderer K (1970) Foundations of non-stationary dynamic programming with discrete time-parameter. Lecture Note Operations Research and Math. Systems 33. Springer, Berlin Heidelberg New York
Google Scholar
Howard RA (1960) Dynamic programming and Markov processes. J Wiley, New York
Google Scholar
Klee V, Minty GL (1972) How good is the simplex algorithm? In: Shiska O (ed) Inequalities III. Academic Press, New York, pp 159–179
Google Scholar
Manne AS (1960) Linear programming and sequential decisions. Manag Sci 6:259–267
Article Google Scholar
Schäl M (1975) Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Z Wahrscheinlichkeitstheorie verw. Gebiete 32: 345–364
Article Google Scholar
Strassen V (1969) Gaussian elimination is not optimal. Numer Math 13:354–356s
Article Google Scholar

Download references

Author information

Authors and Affiliations

Abteilung Mathematik VII (OR), Universität Ulm, Oberer Eselsberg, D-7900, Ulm
U. Meister & U. Holzbaur

Authors

U. Meister
View author publications
You can also search for this author in PubMed Google Scholar
U. Holzbaur
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Meister, U., Holzbaur, U. A polynomial time bound for Howard's policy improvement algorithm. OR Spektrum 8, 37–40 (1986). https://doi.org/10.1007/BF01720771

Download citation

Received: 18 March 1985
Accepted: 16 October 1985
Published: 01 March 1986
Issue Date: March 1986
DOI: https://doi.org/10.1007/BF01720771

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A polynomial time bound for Howard's policy improvement algorithm

Summary

Zusammenfassung

Access this article

Similar content being viewed by others

A practical guide to multi-objective reinforcement learning and planning

Introduction to Reinforcement Learning

Existence and Uniqueness of Quasi-stationary Distributions for Symmetric Markov Processes with Tightness Property

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A polynomial time bound for Howard's policy improvement algorithm

Summary

Zusammenfassung

Access this article

Similar content being viewed by others

A practical guide to multi-objective reinforcement learning and planning

Introduction to Reinforcement Learning

Existence and Uniqueness of Quasi-stationary Distributions for Symmetric Markov Processes with Tightness Property

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation