Skip to main content

Policy Learning for Time-Bounded Reachability in Continuous-Time Markov Decision Processes via Doubly-Stochastic Gradient Ascent

  • Conference paper
  • First Online:
Quantitative Evaluation of Systems (QEST 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9826))

Included in the following conference series:

Abstract

Continuous-time Markov decision processes are an important class of models in a wide range of applications, ranging from cyber-physical systems to synthetic biology. A central problem is how to devise a policy to control the system in order to maximise the probability of satisfying a set of temporal logic specifications. Here we present a novel approach based on statistical model checking and an unbiased estimation of a functional gradient in the space of possible policies. The statistical approach has several advantages over conventional approaches based on uniformisation, as it can also be applied when the model is replaced by a black box, and does not suffer from state-space explosion. The use of a stochastic gradient to guide our search considerably improves the efficiency of learning policies. We demonstrate the method on a proof-of-principle non-linear population model, showing strong performance in a non-trivial task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Kernel functions typically also have an amplitude parameter, which we consider to be equal to 1.

References

  1. Baier, C., Haverkort, B., Hermanns, H., Katoen, J.-P.: Model-checking algorithms for continuous-time Markov chains. IEEE Trans. Softw. Eng. 29(6), 524–541 (2003)

    Article  MATH  Google Scholar 

  2. Baier, C., Hermanns, H., Katoen, J.-P., Haverkort, B.R.: Efficient computation of time-bounded reachability probabilities in uniform continuous-time Markov decision processes. Theor. Comput. Sci. 345(1), 2–26 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  3. Baier, C., Kwiatkowska, M.Z.: Model checking for a probabilistic branching time logic with fairness. Distrib. Comput. 11, 125–155 (1998)

    Article  Google Scholar 

  4. Bartocci, E., Bortolussi, L., Brázdil, T., Milios, D., Sanguinetti, G.: Policy learning for time-bounded reachability in continuous-time Markov decision processes via doubly-stochastic gradient ascent (2016). CoRR ArXiv, abs/1605.09703

    Google Scholar 

  5. Bartocci, E., Bortolussi, L., Nenzi, L., Sanguinetti, G.: System design of stochastic models using robustness of temporal properties. Theor. Comput. Sci. 587, 3–25 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  6. Baxter, J., Bartlett, P.L., Weaver, L.: Experiments with infinite-horizon, policy-gradient estimation. J. Artif. Int. Res. 15(1), 351–381 (2011)

    MathSciNet  MATH  Google Scholar 

  7. Bianco, A., de Alfaro, L.: Model checking of probabilistic and nondeterministic systems. In: Thiagarajan, P.S. (ed.) Foundations of Software Technology and Theoretical Computer Science. LNCS, vol. 1026, pp. 499–513. Springer, Heidelberg (1995)

    Chapter  Google Scholar 

  8. Bortolussi, L., Hillston, J., Latella, D., Massink, M.: Continuous aproximation of collective systems behaviour: a tutorial. Perform. Eval. 70(5), 317–349 (2013)

    Article  Google Scholar 

  9. Bortolussi, L., Milios, D., Sanguinetti, G.: Smoothed model checking for uncertain continuous time Markov chains. Inf. Comput. 247, 235–253 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  10. Bortolussi, L., Sanguinetti, G.: Learning and designing stochastic processes from logical constraints. In: Joshi, K., Siegle, M., Stoelinga, M., D’Argenio, P.R. (eds.) QEST 2013. LNCS, vol. 8054, pp. 89–105. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  11. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT, pp. 177–186. Physica-Verlag HD (2010)

    Google Scholar 

  12. Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, 2nd edn, pp. 421–436. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  13. Butkova, Y., Hatefi, H., Hermanns, H., Krcál, J.: Optimal continuous time Markov decisions. In: Finkbeiner, B., Pu, G., Zhang, L. (eds.) ATVA 2015. LNCS, vol. 9364, pp. 166–182. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  14. Gillespie, D.T.: Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81(25), 2340–2361 (1977)

    Article  Google Scholar 

  15. Guo, X., Hernández-Lerma, O., Prieto-Rumeau, T., Cao, X.-R., Zhang, J., Hu, Q., Lewis, M.E., Vélez, R.: A survey of recent results on continuous-time Markov decision processes. TOP 14(2), 177–261 (2006)

    Article  MathSciNet  Google Scholar 

  16. Henriques, D., Martins, J., Zuliani, P., Platzer, A., Clarke, E.M.: Statistical model checking for Markov decision processes. In: Proceedings of QEST, pp. 84–93. IEEE Computer Society (2012)

    Google Scholar 

  17. Henzinger, T., Jobstmann, B., Wolf, V.: Formalisms for specifying Markovian population models. Int. J. Found. Comput. Sci. 22(04), 823–841 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  18. Jha, S.K., Clarke, E.M., Langmead, C.J., Legay, A., Platzer, A., Zuliani, P.: A Bayesian approach to model checking biological systems. In: Degano, P., Gorrieri, R. (eds.) CMSB 2009. LNCS, vol. 5688, pp. 218–234. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  19. Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 585–591. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  20. Lefevre, C.: Optimal control of a birth and death epidemic process. Oper. Res. 29(5), 971–982 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  21. Mannor, S., Rubinstein, R.Y., Gat, Y.: The cross entropy method for fast policy search. In: ICML, pp. 512–519 (2003)

    Google Scholar 

  22. Medina Ayala, A.I., Andersson, S.B., Belta, C.: Probabilistic control from time-bounded temporal logic specifications in dynamic environments. In: Proceedings of ICRA 2012, pp. 4705–4710. IEEE (2012)

    Google Scholar 

  23. Miller, B.: Finite state continuous time Markov decision processes with an infinite planning horizon. J. Math. Anal. Appl. 22(3), 552–569 (1968)

    Article  MathSciNet  MATH  Google Scholar 

  24. Murata, N.: A statistical study of on-line learning. In: On-Line Learning in Neural Networks, pp. 63–92. Cambridge University Press, Cambridge (1998)

    Google Scholar 

  25. Neuhaeusser, M.R., Zhang, L.: Time-bounded reachability probabilities in continuous-time Markov decision processes. In: Proceedings of QEST, pp. 209–218. IEEE (2010)

    Google Scholar 

  26. Neuhäußer, M.R.: Model checking nondeterministic and randomly timed systems. Ph.D. thesis, RWTH Aachen University (2010)

    Google Scholar 

  27. Qiu, Q., Wu, Q., Pedram, M.: Stochastic modeling of a power-managed system-construction and optimization. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 20(10), 1200–1217 (2001)

    Article  Google Scholar 

  28. Rabe, M.N., Schewe, S.: Finite optimal control for time-bounded reachability in CTMDPs and continuous-time Markov games. Acta Inform. 48, 291–315 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  29. Rabe, M.N., Schewe, S.: Optimal time-abstract schedulers for CTMDPs and continuous-time Markov games. Theor. Comput. Sci. 467, 53–67 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  30. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)

    MATH  Google Scholar 

  31. Rosenstein, M., Barto, A.G.: Robot weightlifting by direct policy search. In: Proceedings of IJCAI, vol. 17, pp. 839–846 (2001)

    Google Scholar 

  32. Sennott, L.I.: Stochastic Dynamic Programming and the Control of Queueing Systems. Wiley, New York (1998)

    Book  MATH  Google Scholar 

  33. Stulp, F., Sigaud, O.: Path integral policy improvement with covariance matrix adaptation (2012). CoRR ArXiv, arXiv:1206.4621

  34. Stulp, F., Sigaud, O.: Policy improvement methods: between black-box optimization and episodic reinforcement learning (2012)

    Google Scholar 

  35. Younes, H.L.S., Simmons, R.G.: Statistical probabilistic model checking with a focus on time-bounded properties. Inf. Comput. 204(9), 1368–1409 (2006)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

L.B. acknowledges partial support from the EU-FET project QUANTICOL (nr. 600708) and by FRA-UniTS. G.S. and D.M. acknowledge support from the European Reasearch Council under grant MLCS306999. T.B. is supported by the Czech Science Foundation, grant No. 15-17564S. E.B. acknowledges the partial support of the Austrian National Research Network S 11405-N23 (RiSE/SHiNE) of the Austrian Science Fund (FWF), the ICT COST Action IC1402 Runtime Verification beyond Monitoring (ARVI) and the IKT der Zukunft of Austrian FFG project HARMONIA (nr. 845631).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimitrios Milios .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Bartocci, E., Bortolussi, L., Brázdil, T., Milios, D., Sanguinetti, G. (2016). Policy Learning for Time-Bounded Reachability in Continuous-Time Markov Decision Processes via Doubly-Stochastic Gradient Ascent. In: Agha, G., Van Houdt, B. (eds) Quantitative Evaluation of Systems. QEST 2016. Lecture Notes in Computer Science(), vol 9826. Springer, Cham. https://doi.org/10.1007/978-3-319-43425-4_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-43425-4_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-43424-7

  • Online ISBN: 978-3-319-43425-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics