Skip to main content

Applications of Reinforcement Learning

  • Chapter
  • First Online:
Machine Learning in Finance

Abstract

This chapter considers real-world applications of reinforcement learning in finance, as well as further advances in the theory presented in the previous chapter. We start with one of the most common problems of quantitative finance, which is the problem of optimal portfolio trading in discrete time. Many practical problems of trading or risk management amount to different forms of dynamic portfolio optimization, with different optimization criteria, portfolio composition, and constraints. This chapter introduces a reinforcement learning approach to option pricing that generalizes the classical Black–Scholes model to a data-driven approach using Q-learning. It then presents a probabilistic extension of Q-learning called G-learning and shows how it can be used for dynamic portfolio optimization. For certain specifications of reward functions, G-learning is semi-analytically tractable and amounts to a probabilistic version of linear quadratic regulators (LQR). Detailed analyses of such cases are presented, and show their solutions with examples from problems of dynamic portfolio optimization and wealth management.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Here we use the notion of model-free learning in the same context as it is normally used in the machine learning literature, namely as a method that does not rely on an explicit model of feature dynamics. Option prices and hedge ratios in the framework presented in this section depend on a model of rewards, and in this sense are model-dependent.

  2. 2.

    When transaction costs are neglected, taking u T = 0 simply means converting all stock into cash. For more details on the choice u T = 0, see Grau (2007).

  3. 3.

    If it is desired to have non-negative option prices for arbitrary levels of risk-aversion, the method developed below can be generalized by using non-quadratic utility functions instead of the quadratic Markowitz utility. This would incur a moderate computational overhead of numerically solving a convex optimization problem at each time step, instead of a quadratic optimization that is solved semi-analytically.

  4. 4.

    The standard continuous-time BSM model is equivalent to using a risk-neutral pricing measure for option valuation. This approach only enables pure risk-based option hedging, which might be suitable for a hedger but not for an option speculator.

  5. 5.

    Note that with our definition of the value function Eq. (10.32), it is not equal to a discounted sum of future rewards.

  6. 6.

    We assume no short sale positions in our setting, and therefore do not include borrowing costs.

  7. 7.

    Note that in physics, free energy is defined with a negative sign relative to Eq. (10.107). This difference is purely a matter of a sign convention, as maximization of Eq. (10.107) can be re-stated as minimization of its negative. Using our sign convention for the free energy function, we follow the reinforcement learning and information theory literature.

  8. 8.

    As in the present formulation actions are constrained by the self-financing condition, an independent Gaussian integration may produce inaccurate results. For a constrained version of the integral with a constraint on the sum of variables, see Exercise 10.6. In the next section we will present a case where an unconstrained Gaussian integration works better.

  9. 9.

    Or, if we want to put additional constraints on the resulting cash-flows, to optimization with one constraint, instead of two constraints as in the Merton approach.

References

  • Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of Political Economy, 81(3), 637–654.

    Article  MathSciNet  MATH  Google Scholar 

  • Boyd, S., Busetti, E., Diamond, S., Kahn, R., Koh, K., Nystrup, P., et al. (2017). Multi-period trading via convex optimization. Foundations and Trends in Optimization, 1–74.

    Google Scholar 

  • Browne, S. (1996). Reaching goals by a deadline: digital options and continuous-time active portfolio management. https://www0.gsb.columbia.edu/mygsb/faculty/research/pubfiles/841/sidbrowne_deadlines.pdf.

    MATH  Google Scholar 

  • Carr, P., Ellis, K., & Gupta, V. (1988). Static hedging of exotic options. Journal of Finance, 53(3), 1165–1190.

    Article  Google Scholar 

  • Cerný, A., & Kallsen, J. (2007). Hedging by sequential regression revisited. Working paper, City University London and TU München.

    Book  MATH  Google Scholar 

  • Cheung, K. C., & Yang, H. (2007). Optimal investment-consumption strategy in a discrete-time model with regime switching. Discrete and Continuous Dynamical Systems, 8(2), 315–332.

    Article  MathSciNet  MATH  Google Scholar 

  • Das, S. R., Ostrov, D., Radhakrishnan, A., & Srivastav, D. (2018). Dynamic portfolio allocation in goals-based wealth management. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3211951.

    Book  Google Scholar 

  • Duan, J. C., & Simonato, J. G. (2001). American option pricing under GARCH by a Markov chain approximation. Journal of Economic Dynamics and Control, 25, 1689–1718.

    Article  MathSciNet  MATH  Google Scholar 

  • Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch model reinforcement learning. Journal of Machine Learning Research, 6, 405–556.

    MATH  Google Scholar 

  • Föllmer, H., & Schweizer, M. (1989). Hedging by sequential regression: An introduction to the mathematics of option trading. ASTIN Bulletin, 18, 147–160.

    Article  Google Scholar 

  • Fox, R., Pakman, A., & Tishby, N. (2015). Taming the noise in reinforcement learning via soft updates. In 32nd Conference on Uncertainty in Artificial Intelligence (UAI). https://arxiv.org/pdf/1512.08562.pdf.

  • Garleanu, N., & Pedersen, L. H. (2013). Dynamic trading with predictable returns and transaction costs. Journal of Finance, 68(6), 2309–2340.

    Article  Google Scholar 

  • Gosavi, A. (2015). Finite horizon Markov control with one-step variance penalties. In Conference Proceedings of the Allerton Conferences, Allerton, IL.

    Google Scholar 

  • Grau, A. J. (2007). Applications of least-square regressions to pricing and hedging of financial derivatives. PhD. thesis, Technische Universit”at München.

    Google Scholar 

  • Halperin, I. (2018). QLBS: Q-learner in the Black-Scholes(-Merton) worlds. Journal of Derivatives 2020, (to be published). Available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3087076.

  • Halperin, I. (2019). The QLBS Q-learner goes NuQLear: Fitted Q iteration, inverse RL, and option portfolios. Quantitative Finance, 19(9). https://doi.org/10.1080/14697688.2019.1622302, available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3102707.

  • Halperin, I., & Feldshteyn, I. (2018). Market self-learning of signals, impact and optimal trading: invisible hand inference with free energy, (or, how we learned to stop worrying and love bounded rationality). https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3174498.

    Google Scholar 

  • Lin, C., Zeng, L., & Wu, H. (2019). Multi-period portfolio optimization in a defined contribution pension plan during the decumulation phase. Journal of Industrial and Management Optimization, 15(1), 401–427. https://doi.org/10.3934/jimo.2018059.

    Article  MathSciNet  MATH  Google Scholar 

  • Longstaff, F. A., & Schwartz, E. S. (2001). Valuing American options by simulation - a simple least-square approach. The Review of Financial Studies, 14(1), 113–147.

    Article  Google Scholar 

  • Markowitz, H. (1959). Portfolio selection: efficient diversification of investment. John Wiley.

    Google Scholar 

  • Marschinski, R., Rossi, P., Tavoni, M., & Cocco, F. (2007). Portfolio selection with probabilistic utility. Annals of Operations Research, 151(1), 223–239.

    Article  MathSciNet  MATH  Google Scholar 

  • Merton, R. C. (1971). Optimum consumption and portfolio rules in a continuous-time model. Journal of Economic Theory, 3(4), 373–413.

    Article  MathSciNet  MATH  Google Scholar 

  • Merton, R. C. (1974). Theory of rational option pricing. Bell Journal of Economics and Management Science, 4(1), 141–183.

    Article  MathSciNet  MATH  Google Scholar 

  • Murphy, S. A. (2005). A generalization error for Q-learning. Journal of Machine Learning Research, 6, 1073–1097.

    MathSciNet  MATH  Google Scholar 

  • Ortega, P. A., & Lee, D. D. (2014). An adversarial interpretation of information-theoretic bounded rationality. In Proceedings of the Twenty-Eighth AAAI Conference on AI. https://arxiv.org/abs/1404.5668.

  • Petrelli, A., Balachandran, R., Siu, O., Chatterjee, R., Jun, Z., & Kapoor, V. (2010). Optimal dynamic hedging of equity options: residual-risks transaction-costs. working paper.

    Google Scholar 

  • Potters, M., Bouchaud, J., & Sestovic, D. (2001). Hedged Monte Carlo: low variance derivative pricing with objective probabilities. Physica A, 289, 517–525.

    Article  MathSciNet  MATH  Google Scholar 

  • Sato, Y. (2019). Model-free reinforcement learning for financial portfolios: a brief survey. https://arxiv.org/pdf/1904.04973.pdf.

    Google Scholar 

  • Schweizer, M. (1995). Variance-optimal hedging in discrete time. Mathematics of Operations Research, 20, 1–32.

    Article  MathSciNet  MATH  Google Scholar 

  • Todorov, E., & Li, W. (2005). A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. In Proceeding of the American Control Conference, Portland OR, USA, pp. 300–306.

    Google Scholar 

  • van Hasselt, H. (2010). Double Q-learning. Advances in Neural Information Processing Systems. http://papers.nips.cc/paper/3964-double-q-learning.pdf.

  • Watkins, C. J. (1989). Learning from delayed rewards. Ph.D. Thesis, Kings College, Cambridge, England.

    Google Scholar 

  • Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(179–192), 3–4.

    MATH  Google Scholar 

  • Wilmott, P. (1998). Derivatives: the theory and practice of financial engineering. Wiley.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Appendix

Appendix

Answers to Multiple Choice Questions

Question 1

Answer: 2, 3.

Question 2

Answer: 2, 4.

Question 3

Answer: 1, 2, 3.

Question 4

Answer: 2, 4.

Python Notebooks

This chapter is accompanied by two notebooks which implement the QLBS model for option pricing and optimal hedging, and G-learning for wealth management. Further details of the notebooks are included in the README.md file.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Dixon, M.F., Halperin, I., Bilokon, P. (2020). Applications of Reinforcement Learning. In: Machine Learning in Finance. Springer, Cham. https://doi.org/10.1007/978-3-030-41068-1_10

Download citation

Publish with us

Policies and ethics