Q-Learning: Flexible Learning About Useful Utilities

Moodie, Erica E. M.; Dean, Nema; Sun, Yue Ru

doi:10.1007/s12561-013-9103-z

Q-Learning: Flexible Learning About Useful Utilities

Published: 12 September 2013

Volume 6, pages 223–243, (2014)
Cite this article

Statistics in Biosciences Aims and scope Submit manuscript

Erica E. M. Moodie¹,
Nema Dean² &
Yue Ru Sun³

910 Accesses
47 Citations
Explore all metrics

Abstract

Dynamic treatment regimes are fast becoming an important part of medicine, with the corresponding change in emphasis from treatment of the disease to treatment of the individual patient. Because of the limited number of trials to evaluate personally tailored treatment sequences, inferring optimal treatment regimes from observational data has increased importance. Q-learning is a popular method for estimating the optimal treatment regime, originally in randomized trials but more recently also in observational data. Previous applications of Q-learning have largely been restricted to continuous utility end-points with linear relationships. This paper is the first attempt at both extending the framework to discrete utilities and implementing the modelling of covariates from linear to more flexible modelling using the generalized additive model (GAM) framework. Simulated data results show that the GAM adapted Q-learning typically outperforms Q-learning with linear models and other frequently-used methods based on propensity scores in terms of coverage and bias/MSE. This represents a promising step toward a more fully general Q-learning approach to estimating optimal dynamic treatment regimes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

Article Open access 19 December 2014

A survey of Bayesian Network structure learning

Article Open access 17 January 2023

Estimating Transition Probabilities from Published Evidence: A Tutorial for Decision Modelers

Article 14 August 2020

References

Chakraborty B (2011) Dynamic treatment regimes for managing chronic health conditions: A statistical perspective. Am J Publ Health 101(1):40–45
Article Google Scholar
Chakraborty B, Laber EB, Zhao Y (2013) Inference for optimal dynamic treatment regimes using an adaptive m-out-of-n bootstrap scheme (submitted)
Chakraborty B, Moodie EEM (2013) Estimating optimal dynamic treatment regimes with shared decision rules across stages: An extension of Q-learning (submitted)
Chakraborty B, Murphy SA, Strecher V (2010) Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res 19(3):317–343
Article MathSciNet Google Scholar
Fava M, Rush AJ, Trivedi MH, Nierenberg AA, Thase ME, Sackeim HA, Quitkin FM, Wisniewski S, Lavori PW, Rosenbaum JF, Kupfer DJ (2003) Background and rationale for the sequenced treatment alternatives to relieve depression (STAR*D) study. Psychiatr Clin North Am 26(2):457–494
Article Google Scholar
Golub G, Heath M, Wahba G (1979) Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21:215–224
Article MATH MathSciNet Google Scholar
Hastie T, Tibshirani R (1986) Generalized additive models. Stat Sci 1(3):297–318
Article MathSciNet Google Scholar
Hastie T, Tibshirani R (1990) Generalized additive models. Chapman & Hall, London
MATH Google Scholar
Huang X, Ning J (2012) Analysis of multi-stage treatments for recurrent diseases. Stat Med 31:2805–2821
Article MathSciNet Google Scholar
Li KC (1987) Asymptotic optimality of C _p, C _L, cross-validation and generalized cross-validation: Discrete index set. Ann Stat 15:958–975
Article MATH Google Scholar
Moodie EEM, Chakraborty B, Kramer MS (2012) Q-learning for estimating optimal dynamic treatment rules from observational data. Can J Stat 40:629–645
Article MATH MathSciNet Google Scholar
Moodie EEM, Richardson TS (2010) Estimating optimal dynamic regimes: Correcting bias under the null. Scand J Stat 37:126–146
Article MATH MathSciNet Google Scholar
Murphy SA, Oslin DW, Rush AJ, Zhu J (2007) Methodological challenges in constructing effective treatment sequences for chronic psychiatric disorders. Neuropsychopharmacology 32:257–262
Article Google Scholar
Murphy SA (2005) A generalization error for Q-learning. J Mach Learn Res 6:1073–1097
MATH MathSciNet Google Scholar
Nahum-Shani I, Qian M, Almirall D, Pelham WE, Gnagy B, Fabiano GA, Waxmonsky JG, Yu J, Murphy SA (2012) Q-Learning: A data analysis method for constructing adaptive interventions. Psychol Methods 17:478–494
Article Google Scholar
R Core Team (2012) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. ISBN 3-900051-07-0
Google Scholar
Robins JM, Hernán MA, Brumback B (2000) Marginal structural models and causal inference in epidemiology. Epidemiology 11:550–560
Article Google Scholar
Robins JM (2004) Optimal structural nested models for optimal sequential decisions. In: Lin DY, Heagerty P (eds) Proceedings of the second Seattle symposium on biostatistics. Springer, New York, pp 189–326
Chapter Google Scholar
Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70:41–55
Article MATH MathSciNet Google Scholar
Rosthoj S, Fullwood C, Henderson R, Stewart S (2006) Estimation of optimal dynamic anticoagulation regimes from observational data: A regret-based approach. Stat Med 25:4197–4215
Article MathSciNet Google Scholar
Schneider LS, Tariot PN, Lyketsos CG, Dagerman KS, Davis KL, Davis S (2001) National institute of mental health clinical antipsychotic trials of intervention effectiveness (CATIE): Alzheimer disease trial methodology. Am J Geriatr Psychiatry 9:346–360
Article Google Scholar
Shortreed SM, Moodie EEM (2012) Estimating the optimal dynamic antipsychotic treatment regime: Evidence from the sequential-multiple assignment randomized CATIE schizophrenia study. J R Stat Soc, Ser B, Stat Methodol 61:577–599
Article MathSciNet Google Scholar
Song R, Wang W, Zeng D, Kosorok MR (2013) Penalized Q-learning for dynamic treatment regimes (submitted)
Sutton RS, Barto AG (1998) Reinforcement learning: An introduction. MIT Press, Cambridge
Google Scholar
Thall PF, Millikan RE, Sung HG (2000) Evaluating multiple treatment courses in clinical trials. Stat Med 30:1011–1128
Article Google Scholar
Thall PF, Sung HG, Estey EH (2002) Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. J Am Stat Assoc 97(457):29–39
Article MATH MathSciNet Google Scholar
Topol E (2012) Creative destruction of medicine: How the digital revolution and personalized medicine will create better health care. Basic Books, New York
Google Scholar
Wood SN (2004) Stable and efficient multiple smoothing parameter estimation for generalized additive models. J Am Stat Assoc 99(467):673–686
Article MATH Google Scholar
Wood SN (2006) Generalized additive models: An introduction with R. Chapman & Hall, London
Google Scholar
Wood SN (2011) Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J R Stat Soc B 73(1):3–36
Article Google Scholar
Xin J, Chakraborty B, Laber EB (2012) qLearn: Estimation and inference for Q-learning. R package version 1.0
Google Scholar
Zhao Y, Kosorok MR, Zeng D (2009) Reinforcement learning design for cancer clinical trials. Stat Med 28:3294–3315
Article MathSciNet Google Scholar
Zhao Y, Zeng D, Socinski MA, Kosorok MR (2011) Reinforcement learning strategies for clinical trials in non-small cell lung cancer. Biometrics 67(4):1422–1433
Article MATH MathSciNet Google Scholar

Download references

Acknowledgements

We would like to thank Dr. Bibhas Chakraborty for insightful discussions. This work is supported by Dr. Moodie’s Discovery Grant from Canada’s Natural Sciences and Engineering Research Council (NSERC).

Author information

Authors and Affiliations

Department of Epidemiology, Biostatistics, and Occupational Health, McGill University, Montreal, Quebec, Canada
Erica E. M. Moodie
School of Mathematics and Statistics, University of Glasgow, Glasgow, Scotland, UK
Nema Dean
Department of Mathematics and Statistics, School of Computer Science, McGill University, Montreal, Quebec, Canada
Yue Ru Sun

Authors

Erica E. M. Moodie
View author publications
You can also search for this author in PubMed Google Scholar
Nema Dean
View author publications
You can also search for this author in PubMed Google Scholar
Yue Ru Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Erica E. M. Moodie.

Appendix: Derivation of the True Dynamic Regime Parameters

In the following, we calculate the true values of the first-interval decision rule parameters ψ ₁₀ and ψ ₁₁ in terms of γ’s and δ’s, the parameters of the generative model following the calculations in [2].

1.1 A.1 Bernoulli Utility

We begin with the derivations for the case of a Bernoulli utility. Let M=γ ₀+γ ₁ C ₁+γ ₂ O ₁+γ ₃ A ₁+γ ₄ O ₁ A ₁+γ ₅ C ₂+f ₁(C ₁)+f ₂(C ₂), and define μ=M+γ ₆ A ₂+γ ₇ O ₂ A ₂+γ ₈ A ₁ A ₂. It follows that

$$\begin{aligned} &\max_{a_2} Q_2(H_2, a_2) \\ &\quad{}= \operatorname{expit} \bigl(M + |\gamma_6 + \gamma_7O_2 + \gamma_8A_1| \bigr) \\ &\quad{}= \operatorname{expit} \biggl(M + \frac{1}{4}(1+O_2) (1+A_1)|f_1| + \frac{1}{4}(1+O_2) (1-A_1)|f_2| \\ &\qquad{} + \frac{1}{4}(1-O_2) (1+A_1)|f_3| + \frac{1}{4}(1-O_2) (1-A_1)|f_4| \biggr) , \end{aligned}$$

where f ₁=γ ₆+γ ₇+γ ₈, f ₂=γ ₆+γ ₇−γ ₈, f ₃=γ ₆−γ ₇+γ ₈, and f ₄=γ ₆−γ ₇−γ ₈. Further,

$$\begin{aligned} E(O_2\mid O_1,A_1) =& \frac{\exp(\delta_1O_1 +\delta_2A_1) -1}{\exp(\delta_1O_1 + \delta_2A_1)+1}, \\ 1 + E(O_2\mid O_1,A_1) =& 2 \operatorname{expit}(\delta_1O_1 + \delta_2A_1), \\ 1 - E(O_2\mid O_1,A_1) =& 2 \bigl(1- \operatorname{expit}(\delta_1O_1 + \delta_2A_1) \bigr). \end{aligned}$$

Therefore,

$$\begin{aligned} Q_1(H_1, A_1) =& E \Bigl[\max _{a_2} Q_2(H_2, a_2) \mid H_1, A_1 \Bigr] \\ =& \operatorname{expit} \biggl(M + \frac{1}{2}(1+A_1)|f_3| + \frac{1}{2}(1-A_1)|f_4| \\ &{} + \frac{1}{2}\operatorname{expit}(\delta_1O_1 + \delta_2A_1) (1+A_1) \bigl( |f_1| - |f_3| \bigr) \\ &{} + \frac{1}{2}\operatorname{expit}(\delta_1O_1 + \delta_2A_1) (1-A_1) \bigl( |f_2| - |f_4| \bigr) \biggr). \end{aligned}$$

Furthermore,

$$\begin{aligned} 4\operatorname{expit}(\delta_1O_1 + \delta_2A_1) =& (1+O_1) (1+A_1) \operatorname{expit}(\delta_1+\delta_2) \\ &{} + (1+O_1) (1-A_1)\operatorname{expit}( \delta_1-\delta_2) \\ &{} + (1-O_1) (1+A_1)\operatorname{expit}(- \delta_1+\delta_2) \\ &{} + (1-O_1) (1-A_1)\operatorname{expit}(- \delta_1-\delta_2). \end{aligned}$$

Since A ₁∈{−1,1}, we have $(1-A_{1}^{2})=0$, (1+A ₁)²=2(1+A ₁) and (1−A ₁)²=2(1−A ₁). It may therefore be deduced that

$$\begin{aligned} 2\operatorname{expit}(\delta_1O_1 + \delta_2A_1) (1+A_1) =& (1+O_1) (1+A_1)\operatorname{expit}(\delta_1+ \delta_2) \\ &{} + (1-O_1) (1+A_1)\operatorname{expit}(- \delta_1+\delta_2); \end{aligned}$$

and

$$\begin{aligned} 2\operatorname{expit}(\delta_1O_1 + \delta_2A_1) (1-A_1) =& (1+O_1) (1-A_1)\operatorname{expit}(\delta_1- \delta_2) \\ &{} + (1-O_1) (1-A_1)\operatorname{expit}(- \delta_1-\delta_2). \end{aligned}$$

Let $k_{1}=\frac{1}{4}\operatorname{expit}(\delta_{1}+\delta_{2})$, $k_{2}=\frac{1}{4}\operatorname{expit}(-\delta_{1}+\delta_{2})$, $k_{3}=\frac{1}{4}\operatorname{expit}(\delta_{1}-\delta_{2})$, $k_{4}=\frac{1}{4}\operatorname{expit}(-\delta_{1}-\delta_{2})$. Therefore,

$$\begin{aligned} &Q_1(H_1, A_1) \\ &\quad{}= \operatorname{expit} \biggl(M + \frac{1}{2}\bigl(|f_3| + |f_4|\bigr) + \frac{1}{2}\bigl(|f_3|-|f_4|\bigr)A_1 \\ &\qquad{} + (1+O_1) (1+A_1)k_1 \bigl( |f_1| - |f_3| \bigr) + (1-O_1) (1+A_1)k_2 \bigl( |f_1| - |f_3| \bigr) \\ &\qquad{} + (1+O_1) (1-A_1)k_3 \bigl( |f_2| - |f_4| \bigr) + (1-O_1) (1-A_1)k_4 \bigl( |f_2| - |f_4| \bigr) \biggr). \end{aligned}$$

Applying a logit transformation to Q ₁(H ₁,A ₁) gives that the coefficient of A ₁ in the above expression for $\operatorname{logit}(Q_{1}(H_{1}, A_{1}))$ is

$$\psi_{10} = \gamma_3 + (k_1+k_2) |f_1| - (k_3+k_4) |f_2| + (k_3+k_4) |f_3| - (k_1+k_2) |f_4|, $$

and the coefficient of O ₁ A ₁ in the expression for Q ₁ is

$$\psi_{11} = \gamma_4 + (k_1-k_2)|f_1| - (k_3-k_4)|f_2| - (k_1-k_2)|f_3| + (k_3-k_4)|f_4|. $$

1.2 A.2 Poisson Utility

In this section, we derive the correspondence between the Q-function model parameters and the parameters from the data-generating models when the utility is given by a Poisson count. As above, let M=γ ₀+γ ₁ C ₁+γ ₂ O ₁+γ ₃ A ₁+γ ₄ O ₁ A ₁+γ ₅ C ₂+f ₁(C ₁)+f ₂(C ₂), and μ=M+γ ₆ A ₂+γ ₇ O ₂ A ₂+γ ₈ A ₁ A ₂. Then

$$\begin{aligned} \max_{a_2} Q_2(H_2, a_2) =& \exp \biggl(M + \frac{1}{4}(1+O_2) (1+A_1)|f_1| + \frac{1}{4}(1+O_2) (1-A_1)|f_2| \\ &{} + \frac{1}{4}(1-O_2) (1+A_1)|f_3| + \frac{1}{4}(1-O_2) (1-A_1)|f_4| \biggr) , \end{aligned}$$

where f ₁, f ₂, f ₃, and f ₄ are as defined above. Thus,

$$\begin{aligned} &Q_1(H_1, A_1) \\ &\quad{}= \exp \biggl(M + \frac{1}{2}(1+A_1)|f_3| + \frac{1}{2}(1-A_1)|f_4| \\ &\qquad{} + \frac{1}{2}\operatorname{expit}(\delta_1O_1 + \delta_2A_1) (1+A_1) \bigl( |f_1| - |f_3| \bigr) \\ &\qquad{} + \frac{1}{2}\operatorname{expit}(\delta_1O_1 + \delta_2A_1) (1-A_1) \bigl( |f_2| - |f_4| \bigr) \biggr) \\ &\quad{}= \exp \biggl(M + \frac{1}{2}\bigl(|f_3| + |f_4|\bigr) + \frac{1}{2}\bigl(|f_3|-|f_4|\bigr)A_1 \\ &\qquad{} + (1+O_1) (1+A_1)k_1 \bigl( |f_1| - |f_3| \bigr) + (1-O_1) (1+A_1)k_2 \bigl( |f_1| - |f_3| \bigr) \\ &\qquad{} + (1+O_1) (1-A_1)k_3 \bigl( |f_2| - |f_4| \bigr) + (1-O_1) (1-A_1)k_4 \bigl( |f_2| - |f_4| \bigr) \biggr) \end{aligned}$$

where k ₁, k ₂, k ₃, k ₄ are as above. Therefore, the coefficients of A ₁ and O ₁ A ₁ in the above expression take the same form as in the case of a Bernoulli utility:

$$\begin{aligned} \psi_{10} =& \gamma_3 + (k_1+k_2) |f_1| - (k_3+k_4) |f_2| + (k_3+k_4) |f_3| - (k_1+k_2) |f_4|, \\ \psi_{11} =& \gamma_4 + (k_1-k_2)|f_1| - (k_3-k_4)|f_2| - (k_1-k_2)|f_3| + (k_3-k_4)|f_4|. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moodie, E.E.M., Dean, N. & Sun, Y.R. Q-Learning: Flexible Learning About Useful Utilities. Stat Biosci 6, 223–243 (2014). https://doi.org/10.1007/s12561-013-9103-z

Download citation

Received: 05 December 2012
Accepted: 22 August 2013
Published: 12 September 2013
Issue Date: November 2014
DOI: https://doi.org/10.1007/s12561-013-9103-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Q-Learning: Flexible Learning About Useful Utilities

Abstract

Access this article

Similar content being viewed by others

Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

A survey of Bayesian Network structure learning

Estimating Transition Probabilities from Published Evidence: A Tutorial for Decision Modelers

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: Derivation of the True Dynamic Regime Parameters

1.1 A.1 Bernoulli Utility

1.2 A.2 Poisson Utility

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Q-Learning: Flexible Learning About Useful Utilities

Abstract

Access this article

Similar content being viewed by others

Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

A survey of Bayesian Network structure learning

Estimating Transition Probabilities from Published Evidence: A Tutorial for Decision Modelers

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: Derivation of the True Dynamic Regime Parameters

Appendix: Derivation of the True Dynamic Regime Parameters

1.1 A.1 Bernoulli Utility

1.2 A.2 Poisson Utility

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation