Skip to main content
Log in

A polynomial time bound for Howard's policy improvement algorithm

  • Theoretical Papers
  • Published:
Operations-Research-Spektrum Aims and scope Submit manuscript

Summary

We consider a discounted Markovian Decision Process (MDP) with finite state and action space. For a fixed discount factor we derive a bound for the number of steps, taken by Howard's policy improvement algorithm (PIA) to determine an optimal policy for the MDP, that is essentially polynomial in the number of states and actions of the MDP. The main tools are the contraction properties of the PIA and a lower bound for the difference of the value functions of a MDP with rational data.

Zusammenfassung

Wir betrachten einen Markoffschen Entscheidungsprozeß mit endlichem Zustands- und Aktionenraum. Bei festgehaltenem Diskontierungsfaktor bestimmen wir eine Grenze für die Anzahl der Schritte in Howards Politikverbesserungsverfahren, die im wesentlichen polynomial in der Anzahl der Zustände und Aktionen ist. Die Haupthilfsmittel sind dabei die Kontraktionseigenschaft des Algorithmus und eine untere

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. De Ghellinck GT (1960) Les problèmes de décisions sequentielles. Cah Centr Etud Rech Opérat 2:161–179

    Google Scholar 

  2. Hinderer K (1970) Foundations of non-stationary dynamic programming with discrete time-parameter. Lecture Note Operations Research and Math. Systems 33. Springer, Berlin Heidelberg New York

    Google Scholar 

  3. Howard RA (1960) Dynamic programming and Markov processes. J Wiley, New York

    Google Scholar 

  4. Klee V, Minty GL (1972) How good is the simplex algorithm? In: Shiska O (ed) Inequalities III. Academic Press, New York, pp 159–179

    Google Scholar 

  5. Manne AS (1960) Linear programming and sequential decisions. Manag Sci 6:259–267

    Article  Google Scholar 

  6. Schäl M (1975) Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Z Wahrscheinlichkeitstheorie verw. Gebiete 32: 345–364

    Article  Google Scholar 

  7. Strassen V (1969) Gaussian elimination is not optimal. Numer Math 13:354–356s

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Meister, U., Holzbaur, U. A polynomial time bound for Howard's policy improvement algorithm. OR Spektrum 8, 37–40 (1986). https://doi.org/10.1007/BF01720771

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01720771

Keywords

Navigation