Abstract
This chapter considers real-world applications of reinforcement learning in finance, as well as further advances in the theory presented in the previous chapter. We start with one of the most common problems of quantitative finance, which is the problem of optimal portfolio trading in discrete time. Many practical problems of trading or risk management amount to different forms of dynamic portfolio optimization, with different optimization criteria, portfolio composition, and constraints. This chapter introduces a reinforcement learning approach to option pricing that generalizes the classical Black–Scholes model to a data-driven approach using Q-learning. It then presents a probabilistic extension of Q-learning called G-learning and shows how it can be used for dynamic portfolio optimization. For certain specifications of reward functions, G-learning is semi-analytically tractable and amounts to a probabilistic version of linear quadratic regulators (LQR). Detailed analyses of such cases are presented, and show their solutions with examples from problems of dynamic portfolio optimization and wealth management.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Here we use the notion of model-free learning in the same context as it is normally used in the machine learning literature, namely as a method that does not rely on an explicit model of feature dynamics. Option prices and hedge ratios in the framework presented in this section depend on a model of rewards, and in this sense are model-dependent.
- 2.
When transaction costs are neglected, taking u T = 0 simply means converting all stock into cash. For more details on the choice u T = 0, see Grau (2007).
- 3.
If it is desired to have non-negative option prices for arbitrary levels of risk-aversion, the method developed below can be generalized by using non-quadratic utility functions instead of the quadratic Markowitz utility. This would incur a moderate computational overhead of numerically solving a convex optimization problem at each time step, instead of a quadratic optimization that is solved semi-analytically.
- 4.
The standard continuous-time BSM model is equivalent to using a risk-neutral pricing measure for option valuation. This approach only enables pure risk-based option hedging, which might be suitable for a hedger but not for an option speculator.
- 5.
Note that with our definition of the value function Eq. (10.32), it is not equal to a discounted sum of future rewards.
- 6.
We assume no short sale positions in our setting, and therefore do not include borrowing costs.
- 7.
Note that in physics, free energy is defined with a negative sign relative to Eq. (10.107). This difference is purely a matter of a sign convention, as maximization of Eq. (10.107) can be re-stated as minimization of its negative. Using our sign convention for the free energy function, we follow the reinforcement learning and information theory literature.
- 8.
As in the present formulation actions are constrained by the self-financing condition, an independent Gaussian integration may produce inaccurate results. For a constrained version of the integral with a constraint on the sum of variables, see Exercise 10.6. In the next section we will present a case where an unconstrained Gaussian integration works better.
- 9.
Or, if we want to put additional constraints on the resulting cash-flows, to optimization with one constraint, instead of two constraints as in the Merton approach.
References
Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of Political Economy, 81(3), 637–654.
Boyd, S., Busetti, E., Diamond, S., Kahn, R., Koh, K., Nystrup, P., et al. (2017). Multi-period trading via convex optimization. Foundations and Trends in Optimization, 1–74.
Browne, S. (1996). Reaching goals by a deadline: digital options and continuous-time active portfolio management. https://www0.gsb.columbia.edu/mygsb/faculty/research/pubfiles/841/sidbrowne_deadlines.pdf.
Carr, P., Ellis, K., & Gupta, V. (1988). Static hedging of exotic options. Journal of Finance, 53(3), 1165–1190.
Cerný, A., & Kallsen, J. (2007). Hedging by sequential regression revisited. Working paper, City University London and TU München.
Cheung, K. C., & Yang, H. (2007). Optimal investment-consumption strategy in a discrete-time model with regime switching. Discrete and Continuous Dynamical Systems, 8(2), 315–332.
Das, S. R., Ostrov, D., Radhakrishnan, A., & Srivastav, D. (2018). Dynamic portfolio allocation in goals-based wealth management. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3211951.
Duan, J. C., & Simonato, J. G. (2001). American option pricing under GARCH by a Markov chain approximation. Journal of Economic Dynamics and Control, 25, 1689–1718.
Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch model reinforcement learning. Journal of Machine Learning Research, 6, 405–556.
Föllmer, H., & Schweizer, M. (1989). Hedging by sequential regression: An introduction to the mathematics of option trading. ASTIN Bulletin, 18, 147–160.
Fox, R., Pakman, A., & Tishby, N. (2015). Taming the noise in reinforcement learning via soft updates. In 32nd Conference on Uncertainty in Artificial Intelligence (UAI). https://arxiv.org/pdf/1512.08562.pdf.
Garleanu, N., & Pedersen, L. H. (2013). Dynamic trading with predictable returns and transaction costs. Journal of Finance, 68(6), 2309–2340.
Gosavi, A. (2015). Finite horizon Markov control with one-step variance penalties. In Conference Proceedings of the Allerton Conferences, Allerton, IL.
Grau, A. J. (2007). Applications of least-square regressions to pricing and hedging of financial derivatives. PhD. thesis, Technische Universit”at München.
Halperin, I. (2018). QLBS: Q-learner in the Black-Scholes(-Merton) worlds. Journal of Derivatives 2020, (to be published). Available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3087076.
Halperin, I. (2019). The QLBS Q-learner goes NuQLear: Fitted Q iteration, inverse RL, and option portfolios. Quantitative Finance, 19(9). https://doi.org/10.1080/14697688.2019.1622302, available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3102707.
Halperin, I., & Feldshteyn, I. (2018). Market self-learning of signals, impact and optimal trading: invisible hand inference with free energy, (or, how we learned to stop worrying and love bounded rationality). https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3174498.
Lin, C., Zeng, L., & Wu, H. (2019). Multi-period portfolio optimization in a defined contribution pension plan during the decumulation phase. Journal of Industrial and Management Optimization, 15(1), 401–427. https://doi.org/10.3934/jimo.2018059.
Longstaff, F. A., & Schwartz, E. S. (2001). Valuing American options by simulation - a simple least-square approach. The Review of Financial Studies, 14(1), 113–147.
Markowitz, H. (1959). Portfolio selection: efficient diversification of investment. John Wiley.
Marschinski, R., Rossi, P., Tavoni, M., & Cocco, F. (2007). Portfolio selection with probabilistic utility. Annals of Operations Research, 151(1), 223–239.
Merton, R. C. (1971). Optimum consumption and portfolio rules in a continuous-time model. Journal of Economic Theory, 3(4), 373–413.
Merton, R. C. (1974). Theory of rational option pricing. Bell Journal of Economics and Management Science, 4(1), 141–183.
Murphy, S. A. (2005). A generalization error for Q-learning. Journal of Machine Learning Research, 6, 1073–1097.
Ortega, P. A., & Lee, D. D. (2014). An adversarial interpretation of information-theoretic bounded rationality. In Proceedings of the Twenty-Eighth AAAI Conference on AI. https://arxiv.org/abs/1404.5668.
Petrelli, A., Balachandran, R., Siu, O., Chatterjee, R., Jun, Z., & Kapoor, V. (2010). Optimal dynamic hedging of equity options: residual-risks transaction-costs. working paper.
Potters, M., Bouchaud, J., & Sestovic, D. (2001). Hedged Monte Carlo: low variance derivative pricing with objective probabilities. Physica A, 289, 517–525.
Sato, Y. (2019). Model-free reinforcement learning for financial portfolios: a brief survey. https://arxiv.org/pdf/1904.04973.pdf.
Schweizer, M. (1995). Variance-optimal hedging in discrete time. Mathematics of Operations Research, 20, 1–32.
Todorov, E., & Li, W. (2005). A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. In Proceeding of the American Control Conference, Portland OR, USA, pp. 300–306.
van Hasselt, H. (2010). Double Q-learning. Advances in Neural Information Processing Systems. http://papers.nips.cc/paper/3964-double-q-learning.pdf.
Watkins, C. J. (1989). Learning from delayed rewards. Ph.D. Thesis, Kings College, Cambridge, England.
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(179–192), 3–4.
Wilmott, P. (1998). Derivatives: the theory and practice of financial engineering. Wiley.
Author information
Authors and Affiliations
Appendix
Appendix
Answers to Multiple Choice Questions
Question 1
Answer: 2, 3.
Question 2
Answer: 2, 4.
Question 3
Answer: 1, 2, 3.
Question 4
Answer: 2, 4.
Python Notebooks
This chapter is accompanied by two notebooks which implement the QLBS model for option pricing and optimal hedging, and G-learning for wealth management. Further details of the notebooks are included in the README.md file.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Dixon, M.F., Halperin, I., Bilokon, P. (2020). Applications of Reinforcement Learning. In: Machine Learning in Finance. Springer, Cham. https://doi.org/10.1007/978-3-030-41068-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-41068-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41067-4
Online ISBN: 978-3-030-41068-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)