Applications of Reinforcement Learning

Dixon, Matthew F.; Halperin, Igor; Bilokon, Paul

doi:10.1007/978-3-030-41068-1_10

Matthew F. Dixon⁴,
Igor Halperin⁵ &
Paul Bilokon⁶

12k Accesses
1 Citations

Abstract

This chapter considers real-world applications of reinforcement learning in finance, as well as further advances in the theory presented in the previous chapter. We start with one of the most common problems of quantitative finance, which is the problem of optimal portfolio trading in discrete time. Many practical problems of trading or risk management amount to different forms of dynamic portfolio optimization, with different optimization criteria, portfolio composition, and constraints. This chapter introduces a reinforcement learning approach to option pricing that generalizes the classical Black–Scholes model to a data-driven approach using Q-learning. It then presents a probabilistic extension of Q-learning called G-learning and shows how it can be used for dynamic portfolio optimization. For certain specifications of reward functions, G-learning is semi-analytically tractable and amounts to a probabilistic version of linear quadratic regulators (LQR). Detailed analyses of such cases are presented, and show their solutions with examples from problems of dynamic portfolio optimization and wealth management.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Hardcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Here we use the notion of model-free learning in the same context as it is normally used in the machine learning literature, namely as a method that does not rely on an explicit model of feature dynamics. Option prices and hedge ratios in the framework presented in this section depend on a model of rewards, and in this sense are model-dependent.
2.
When transaction costs are neglected, taking u _T = 0 simply means converting all stock into cash. For more details on the choice u _T = 0, see Grau (2007).
3.
If it is desired to have non-negative option prices for arbitrary levels of risk-aversion, the method developed below can be generalized by using non-quadratic utility functions instead of the quadratic Markowitz utility. This would incur a moderate computational overhead of numerically solving a convex optimization problem at each time step, instead of a quadratic optimization that is solved semi-analytically.
4.
The standard continuous-time BSM model is equivalent to using a risk-neutral pricing measure for option valuation. This approach only enables pure risk-based option hedging, which might be suitable for a hedger but not for an option speculator.
5.
Note that with our definition of the value function Eq. (10.32), it is not equal to a discounted sum of future rewards.
6.
We assume no short sale positions in our setting, and therefore do not include borrowing costs.
7.
Note that in physics, free energy is defined with a negative sign relative to Eq. (10.107). This difference is purely a matter of a sign convention, as maximization of Eq. (10.107) can be re-stated as minimization of its negative. Using our sign convention for the free energy function, we follow the reinforcement learning and information theory literature.
8.
As in the present formulation actions are constrained by the self-financing condition, an independent Gaussian integration may produce inaccurate results. For a constrained version of the integral with a constraint on the sum of variables, see Exercise 10.6. In the next section we will present a case where an unconstrained Gaussian integration works better.
9.
Or, if we want to put additional constraints on the resulting cash-flows, to optimization with one constraint, instead of two constraints as in the Merton approach.

References

Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of Political Economy, 81(3), 637–654.
Article MathSciNet MATH Google Scholar
Boyd, S., Busetti, E., Diamond, S., Kahn, R., Koh, K., Nystrup, P., et al. (2017). Multi-period trading via convex optimization. Foundations and Trends in Optimization, 1–74.
Google Scholar
Browne, S. (1996). Reaching goals by a deadline: digital options and continuous-time active portfolio management. https://www0.gsb.columbia.edu/mygsb/faculty/research/pubfiles/841/sidbrowne_deadlines.pdf.
MATH Google Scholar
Carr, P., Ellis, K., & Gupta, V. (1988). Static hedging of exotic options. Journal of Finance, 53(3), 1165–1190.
Article Google Scholar
Cerný, A., & Kallsen, J. (2007). Hedging by sequential regression revisited. Working paper, City University London and TU München.
Book MATH Google Scholar
Cheung, K. C., & Yang, H. (2007). Optimal investment-consumption strategy in a discrete-time model with regime switching. Discrete and Continuous Dynamical Systems, 8(2), 315–332.
Article MathSciNet MATH Google Scholar
Das, S. R., Ostrov, D., Radhakrishnan, A., & Srivastav, D. (2018). Dynamic portfolio allocation in goals-based wealth management. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3211951.
Book Google Scholar
Duan, J. C., & Simonato, J. G. (2001). American option pricing under GARCH by a Markov chain approximation. Journal of Economic Dynamics and Control, 25, 1689–1718.
Article MathSciNet MATH Google Scholar
Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch model reinforcement learning. Journal of Machine Learning Research, 6, 405–556.
MATH Google Scholar
Föllmer, H., & Schweizer, M. (1989). Hedging by sequential regression: An introduction to the mathematics of option trading. ASTIN Bulletin, 18, 147–160.
Article Google Scholar
Fox, R., Pakman, A., & Tishby, N. (2015). Taming the noise in reinforcement learning via soft updates. In 32nd Conference on Uncertainty in Artificial Intelligence (UAI). https://arxiv.org/pdf/1512.08562.pdf.
Garleanu, N., & Pedersen, L. H. (2013). Dynamic trading with predictable returns and transaction costs. Journal of Finance, 68(6), 2309–2340.
Article Google Scholar
Gosavi, A. (2015). Finite horizon Markov control with one-step variance penalties. In Conference Proceedings of the Allerton Conferences, Allerton, IL.
Google Scholar
Grau, A. J. (2007). Applications of least-square regressions to pricing and hedging of financial derivatives. PhD. thesis, Technische Universit”at München.
Google Scholar
Halperin, I. (2018). QLBS: Q-learner in the Black-Scholes(-Merton) worlds. Journal of Derivatives 2020, (to be published). Available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3087076.
Halperin, I. (2019). The QLBS Q-learner goes NuQLear: Fitted Q iteration, inverse RL, and option portfolios. Quantitative Finance, 19(9). https://doi.org/10.1080/14697688.2019.1622302, available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3102707.
Halperin, I., & Feldshteyn, I. (2018). Market self-learning of signals, impact and optimal trading: invisible hand inference with free energy, (or, how we learned to stop worrying and love bounded rationality). https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3174498.
Google Scholar
Lin, C., Zeng, L., & Wu, H. (2019). Multi-period portfolio optimization in a defined contribution pension plan during the decumulation phase. Journal of Industrial and Management Optimization, 15(1), 401–427. https://doi.org/10.3934/jimo.2018059.
Article MathSciNet MATH Google Scholar
Longstaff, F. A., & Schwartz, E. S. (2001). Valuing American options by simulation - a simple least-square approach. The Review of Financial Studies, 14(1), 113–147.
Article Google Scholar
Markowitz, H. (1959). Portfolio selection: efficient diversification of investment. John Wiley.
Google Scholar
Marschinski, R., Rossi, P., Tavoni, M., & Cocco, F. (2007). Portfolio selection with probabilistic utility. Annals of Operations Research, 151(1), 223–239.
Article MathSciNet MATH Google Scholar
Merton, R. C. (1971). Optimum consumption and portfolio rules in a continuous-time model. Journal of Economic Theory, 3(4), 373–413.
Article MathSciNet MATH Google Scholar
Merton, R. C. (1974). Theory of rational option pricing. Bell Journal of Economics and Management Science, 4(1), 141–183.
Article MathSciNet MATH Google Scholar
Murphy, S. A. (2005). A generalization error for Q-learning. Journal of Machine Learning Research, 6, 1073–1097.
MathSciNet MATH Google Scholar
Ortega, P. A., & Lee, D. D. (2014). An adversarial interpretation of information-theoretic bounded rationality. In Proceedings of the Twenty-Eighth AAAI Conference on AI. https://arxiv.org/abs/1404.5668.
Petrelli, A., Balachandran, R., Siu, O., Chatterjee, R., Jun, Z., & Kapoor, V. (2010). Optimal dynamic hedging of equity options: residual-risks transaction-costs. working paper.
Google Scholar
Potters, M., Bouchaud, J., & Sestovic, D. (2001). Hedged Monte Carlo: low variance derivative pricing with objective probabilities. Physica A, 289, 517–525.
Article MathSciNet MATH Google Scholar
Sato, Y. (2019). Model-free reinforcement learning for financial portfolios: a brief survey. https://arxiv.org/pdf/1904.04973.pdf.
Google Scholar
Schweizer, M. (1995). Variance-optimal hedging in discrete time. Mathematics of Operations Research, 20, 1–32.
Article MathSciNet MATH Google Scholar
Todorov, E., & Li, W. (2005). A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. In Proceeding of the American Control Conference, Portland OR, USA, pp. 300–306.
Google Scholar
van Hasselt, H. (2010). Double Q-learning. Advances in Neural Information Processing Systems. http://papers.nips.cc/paper/3964-double-q-learning.pdf.
Watkins, C. J. (1989). Learning from delayed rewards. Ph.D. Thesis, Kings College, Cambridge, England.
Google Scholar
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(179–192), 3–4.
MATH Google Scholar
Wilmott, P. (1998). Derivatives: the theory and practice of financial engineering. Wiley.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Applied Mathematics, Illinois Institute of Technology, Chicago, IL, USA
Matthew F. Dixon
Tandon School of Engineering, New York University, Brooklyn, NY, USA
Igor Halperin
Department of Mathematics, Imperial College London, London, UK
Paul Bilokon

Authors

Matthew F. Dixon
View author publications
You can also search for this author in PubMed Google Scholar
Igor Halperin
View author publications
You can also search for this author in PubMed Google Scholar
Paul Bilokon
View author publications
You can also search for this author in PubMed Google Scholar

Appendix

Answers to Multiple Choice Questions

Question 1

Answer: 2, 3.

Question 2

Answer: 2, 4.

Question 3

Answer: 1, 2, 3.

Question 4

Answer: 2, 4.

Python Notebooks

This chapter is accompanied by two notebooks which implement the QLBS model for option pricing and optimal hedging, and G-learning for wealth management. Further details of the notebooks are included in the README.md file.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dixon, M.F., Halperin, I., Bilokon, P. (2020). Applications of Reinforcement Learning. In: Machine Learning in Finance. Springer, Cham. https://doi.org/10.1007/978-3-030-41068-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-41068-1_10
Published: 02 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41067-4
Online ISBN: 978-3-030-41068-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Applications of Reinforcement Learning

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Appendix

Appendix

Answers to Multiple Choice Questions

Python Notebooks

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation