Policy Learning for Time-Bounded Reachability in Continuous-Time Markov Decision Processes via Doubly-Stochastic Gradient Ascent

Bartocci, Ezio; Bortolussi, Luca; Brázdil, Tomǎš; Milios, Dimitrios; Sanguinetti, Guido

doi:10.1007/978-3-319-43425-4_17

Ezio Bartocci¹⁵,
Luca Bortolussi^16,17,18,
Tomǎš Brázdil¹⁹,
Dimitrios Milios²⁰ &
…
Guido Sanguinetti^20,21

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9826))

Included in the following conference series:

International Conference on Quantitative Evaluation of Systems

961 Accesses
2 Citations

Abstract

Continuous-time Markov decision processes are an important class of models in a wide range of applications, ranging from cyber-physical systems to synthetic biology. A central problem is how to devise a policy to control the system in order to maximise the probability of satisfying a set of temporal logic specifications. Here we present a novel approach based on statistical model checking and an unbiased estimation of a functional gradient in the space of possible policies. The statistical approach has several advantages over conventional approaches based on uniformisation, as it can also be applied when the model is replaced by a black box, and does not suffer from state-space explosion. The use of a stochastic gradient to guide our search considerably improves the efficiency of learning policies. We demonstrate the method on a proof-of-principle non-linear population model, showing strong performance in a non-trivial task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Kernel functions typically also have an amplitude parameter, which we consider to be equal to 1.

References

Baier, C., Haverkort, B., Hermanns, H., Katoen, J.-P.: Model-checking algorithms for continuous-time Markov chains. IEEE Trans. Softw. Eng. 29(6), 524–541 (2003)
Article MATH Google Scholar
Baier, C., Hermanns, H., Katoen, J.-P., Haverkort, B.R.: Efficient computation of time-bounded reachability probabilities in uniform continuous-time Markov decision processes. Theor. Comput. Sci. 345(1), 2–26 (2005)
Article MathSciNet MATH Google Scholar
Baier, C., Kwiatkowska, M.Z.: Model checking for a probabilistic branching time logic with fairness. Distrib. Comput. 11, 125–155 (1998)
Article Google Scholar
Bartocci, E., Bortolussi, L., Brázdil, T., Milios, D., Sanguinetti, G.: Policy learning for time-bounded reachability in continuous-time Markov decision processes via doubly-stochastic gradient ascent (2016). CoRR ArXiv, abs/1605.09703
Google Scholar
Bartocci, E., Bortolussi, L., Nenzi, L., Sanguinetti, G.: System design of stochastic models using robustness of temporal properties. Theor. Comput. Sci. 587, 3–25 (2015)
Article MathSciNet MATH Google Scholar
Baxter, J., Bartlett, P.L., Weaver, L.: Experiments with infinite-horizon, policy-gradient estimation. J. Artif. Int. Res. 15(1), 351–381 (2011)
MathSciNet MATH Google Scholar
Bianco, A., de Alfaro, L.: Model checking of probabilistic and nondeterministic systems. In: Thiagarajan, P.S. (ed.) Foundations of Software Technology and Theoretical Computer Science. LNCS, vol. 1026, pp. 499–513. Springer, Heidelberg (1995)
Chapter Google Scholar
Bortolussi, L., Hillston, J., Latella, D., Massink, M.: Continuous aproximation of collective systems behaviour: a tutorial. Perform. Eval. 70(5), 317–349 (2013)
Article Google Scholar
Bortolussi, L., Milios, D., Sanguinetti, G.: Smoothed model checking for uncertain continuous time Markov chains. Inf. Comput. 247, 235–253 (2016)
Article MathSciNet MATH Google Scholar
Bortolussi, L., Sanguinetti, G.: Learning and designing stochastic processes from logical constraints. In: Joshi, K., Siegle, M., Stoelinga, M., D’Argenio, P.R. (eds.) QEST 2013. LNCS, vol. 8054, pp. 89–105. Springer, Heidelberg (2013)
Chapter Google Scholar
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT, pp. 177–186. Physica-Verlag HD (2010)
Google Scholar
Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, 2nd edn, pp. 421–436. Springer, Heidelberg (2012)
Chapter Google Scholar
Butkova, Y., Hatefi, H., Hermanns, H., Krcál, J.: Optimal continuous time Markov decisions. In: Finkbeiner, B., Pu, G., Zhang, L. (eds.) ATVA 2015. LNCS, vol. 9364, pp. 166–182. Springer, Heidelberg (2015)
Chapter Google Scholar
Gillespie, D.T.: Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81(25), 2340–2361 (1977)
Article Google Scholar
Guo, X., Hernández-Lerma, O., Prieto-Rumeau, T., Cao, X.-R., Zhang, J., Hu, Q., Lewis, M.E., Vélez, R.: A survey of recent results on continuous-time Markov decision processes. TOP 14(2), 177–261 (2006)
Article MathSciNet Google Scholar
Henriques, D., Martins, J., Zuliani, P., Platzer, A., Clarke, E.M.: Statistical model checking for Markov decision processes. In: Proceedings of QEST, pp. 84–93. IEEE Computer Society (2012)
Google Scholar
Henzinger, T., Jobstmann, B., Wolf, V.: Formalisms for specifying Markovian population models. Int. J. Found. Comput. Sci. 22(04), 823–841 (2011)
Article MathSciNet MATH Google Scholar
Jha, S.K., Clarke, E.M., Langmead, C.J., Legay, A., Platzer, A., Zuliani, P.: A Bayesian approach to model checking biological systems. In: Degano, P., Gorrieri, R. (eds.) CMSB 2009. LNCS, vol. 5688, pp. 218–234. Springer, Heidelberg (2009)
Chapter Google Scholar
Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 585–591. Springer, Heidelberg (2011)
Chapter Google Scholar
Lefevre, C.: Optimal control of a birth and death epidemic process. Oper. Res. 29(5), 971–982 (1981)
Article MathSciNet MATH Google Scholar
Mannor, S., Rubinstein, R.Y., Gat, Y.: The cross entropy method for fast policy search. In: ICML, pp. 512–519 (2003)
Google Scholar
Medina Ayala, A.I., Andersson, S.B., Belta, C.: Probabilistic control from time-bounded temporal logic specifications in dynamic environments. In: Proceedings of ICRA 2012, pp. 4705–4710. IEEE (2012)
Google Scholar
Miller, B.: Finite state continuous time Markov decision processes with an infinite planning horizon. J. Math. Anal. Appl. 22(3), 552–569 (1968)
Article MathSciNet MATH Google Scholar
Murata, N.: A statistical study of on-line learning. In: On-Line Learning in Neural Networks, pp. 63–92. Cambridge University Press, Cambridge (1998)
Google Scholar
Neuhaeusser, M.R., Zhang, L.: Time-bounded reachability probabilities in continuous-time Markov decision processes. In: Proceedings of QEST, pp. 209–218. IEEE (2010)
Google Scholar
Neuhäußer, M.R.: Model checking nondeterministic and randomly timed systems. Ph.D. thesis, RWTH Aachen University (2010)
Google Scholar
Qiu, Q., Wu, Q., Pedram, M.: Stochastic modeling of a power-managed system-construction and optimization. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 20(10), 1200–1217 (2001)
Article Google Scholar
Rabe, M.N., Schewe, S.: Finite optimal control for time-bounded reachability in CTMDPs and continuous-time Markov games. Acta Inform. 48, 291–315 (2011)
Article MathSciNet MATH Google Scholar
Rabe, M.N., Schewe, S.: Optimal time-abstract schedulers for CTMDPs and continuous-time Markov games. Theor. Comput. Sci. 467, 53–67 (2013)
Article MathSciNet MATH Google Scholar
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)
MATH Google Scholar
Rosenstein, M., Barto, A.G.: Robot weightlifting by direct policy search. In: Proceedings of IJCAI, vol. 17, pp. 839–846 (2001)
Google Scholar
Sennott, L.I.: Stochastic Dynamic Programming and the Control of Queueing Systems. Wiley, New York (1998)
Book MATH Google Scholar
Stulp, F., Sigaud, O.: Path integral policy improvement with covariance matrix adaptation (2012). CoRR ArXiv, arXiv:1206.4621
Stulp, F., Sigaud, O.: Policy improvement methods: between black-box optimization and episodic reinforcement learning (2012)
Google Scholar
Younes, H.L.S., Simmons, R.G.: Statistical probabilistic model checking with a focus on time-bounded properties. Inf. Comput. 204(9), 1368–1409 (2006)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

L.B. acknowledges partial support from the EU-FET project QUANTICOL (nr. 600708) and by FRA-UniTS. G.S. and D.M. acknowledge support from the European Reasearch Council under grant MLCS306999. T.B. is supported by the Czech Science Foundation, grant No. 15-17564S. E.B. acknowledges the partial support of the Austrian National Research Network S 11405-N23 (RiSE/SHiNE) of the Austrian Science Fund (FWF), the ICT COST Action IC1402 Runtime Verification beyond Monitoring (ARVI) and the IKT der Zukunft of Austrian FFG project HARMONIA (nr. 845631).

Author information

Authors and Affiliations

Faculty of Informatics, Vienna University of Technology, Vienna, Austria
Ezio Bartocci
Department of Maths and Geosciences, University of Trieste, Trieste, Italy
Luca Bortolussi
CNR/ISTI, Pisa, Italy
Luca Bortolussi
Modelling and Simulation Group, Saarland University, Saarbrücken, Germany
Luca Bortolussi
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Tomǎš Brázdil
School of Informatics, University of Edinburgh, Edinburgh, UK
Dimitrios Milios & Guido Sanguinetti
SynthSys, Centre for Synthetic and Systems Biology, University of Edinburgh, Edinburgh, UK
Guido Sanguinetti

Authors

Ezio Bartocci
View author publications
You can also search for this author in PubMed Google Scholar
Luca Bortolussi
View author publications
You can also search for this author in PubMed Google Scholar
Tomǎš Brázdil
View author publications
You can also search for this author in PubMed Google Scholar
Dimitrios Milios
View author publications
You can also search for this author in PubMed Google Scholar
Guido Sanguinetti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimitrios Milios .

Editor information

Editors and Affiliations

University of Illinois , Urbana, Illinois, USA
Gul Agha
University of Antwerp , Antwerp, Belgium
Benny Van Houdt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bartocci, E., Bortolussi, L., Brázdil, T., Milios, D., Sanguinetti, G. (2016). Policy Learning for Time-Bounded Reachability in Continuous-Time Markov Decision Processes via Doubly-Stochastic Gradient Ascent. In: Agha, G., Van Houdt, B. (eds) Quantitative Evaluation of Systems. QEST 2016. Lecture Notes in Computer Science(), vol 9826. Springer, Cham. https://doi.org/10.1007/978-3-319-43425-4_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-43425-4_17
Published: 03 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43424-7
Online ISBN: 978-3-319-43425-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics