Incorporating causal factors into reinforcement learning for dynamic treatment regimes in HIV
- 109 Downloads
Abstract
Background
Reinforcement learning (RL) provides a promising technique to solve complex sequential decision making problems in health care domains. However, existing studies simply apply naive RL algorithms in discovering optimal treatment strategies for a targeted problem. This kind of direct applications ignores the abundant causal relationships between treatment options and the associated outcomes that are inherent in medical domains.
Methods
This paper investigates how to integrate causal factors into an RL process in order to facilitate the final learning performance and increase explanations of learned strategies. A causal policy gradient algorithm is proposed and evaluated in dynamic treatment regimes (DTRs) for HIV based on a simulated computational model.
Results
Simulations prove the effectiveness of the proposed algorithm for designing more efficient treatment protocols in HIV, and different definitions of the causal factors could have significant influence on the final learning performance, indicating the necessity of human prior knowledge on defining a suitable causal relationships for a given problem.
Conclusions
More efficient and robust DTRs for HIV can be derived through incorporation of causal factors between options of anti-HIV drugs and the associated treatment outcomes.
Keywords
Reinforcement learning Dynamic treatment regime HIV Causal factorsBackground
Reinforcement learning (RL) [1] has achieved tremendous achievements in solving complex sequential decision making problems in various health care domains, such as treatment in HIV [2], cancer [3], diabetics [4], schizophrenia [5], and sepsis [6]. In such typical RL implementations, a model designer normally formulates the learning components (the objective, state, action and reward etc.), specifies the presentation and efficiency techniques, and then simply lets the RL algorithms run until a satisfactory solution is obtained. Such fully automated and black-box learning processes ignore rich knowledge encoded in causal relationships between variables like duration, dose or type of treatments, and the corresponding therapeutic outcomes. Thus, the learned policies may not be interpretable enough to explain why some policies are helpful while others are not [7].
Discovering effective treatment strategies for HIV-infected individuals remains one of the most significant challenges in medical research. To date, the effective way to treat HIV makes use of a combination of anti-HIV drugs (i.e., antiretrovirals) in the form of Highly Active Antiretroviral Therapy (HAART) to inhibit the development of drug-resistant HIV strains [8]. Patients suffering from HIV are typically prescribed a series of treatments over time in order to maximize the long-term positive outcomes of reducing patients’ treatment burden and improving adherence to medication. However, due to the differences between individuals in their immune responses to treatment at a particular time, discovering the optimal drug combinations and scheduling strategy is a difficult task in both medical research and clinical trials. In this paper, we propose a causal policy gradient (CPG) algorithm that is able of incorporating causal factors into an RL process in order to facilitate the final learning performance and increase explanations of learned strategies. We illustrate how CPG can be applied to solve DTRs problems in HIV. Experiments prove the effectiveness of CPG in designing more efficient and robust treatment protocols in HIV. The remaining paper is organized as follows. We first discuss some related work and introduce the main principle of CPG algorithm. We then provide the details of implementation of CPG in HIV treatment. Finally, we conclude the paper by pointing out some directions for future work.
Related work
RL has been applied to DTRs in HIV by several studies. Ernst et al. [9] first introduced RL techniques in computing Structured Treatment Interruption (STI) strategies for HIV infected patients. Using a mathematical model [8] to artificially generate the clinical data, a batch RL method, i.e., fitted Q iteration (FQI) with extremely randomized trees, was applied to learn an optimal drug prescription strategy in an off-line manner. The derived STI strategy is featured with a cycling between the two main anti-HIV drugs: Reverse Transcriptase Inhibitors (RTI) and Protease Inhibitors (PI). Using the same mathematical model, Parbhoo [10] further implemented three kinds of batch RL methods, FQI with extremely randomized trees, neural FQI and least square policy iterations (LSPI), to the problem of drug scheduling and HIV treatment design. Results indicated that each learning technique had its own advantages and disadvantages. Moreover, a testing based on a ten-year period of real clinical data from 250 HIV-infected patients in Charlotte Maxeke Johannesburg Academic Hospital, South Africa, verified that the RL methods were capable of suggesting treatments that are reasonably compliant with those suggested by clinicians.
The authors in [11] used the Q-learning algorithm in HIV treatment and obtained a good performance and high functionality in controlling the free virions for both certain and uncertain HIV models. A mixture-of-experts approach was proposed in [2] to combine the strengths of both kernel-based regression methods (i.e., history-alignment model) and RL (i.e., model-based Bayesian POMDP model) for HIV therapy selection. Making use of a subset of the EuResist database consisting of HIV genotype and treatment response data for 32,960 patients, together with the 312 most common drug combinations in the cohort, the treatment therapy derived by the mixture-of-experts approach outperform those derived by using each method alone. Marivate et al. [12] formalized a routine to accommodate multiple sources of uncertainty in batch RL methods to better evaluate the effectiveness of treatments across subpopulations of HIV patients. Killian et al. [13] similarly attempt to address and identify the variations across subpopulations in the development of HIV treatment policies by transferring knowledge between task instances.
Unlike the above studies that mainly focus on value-based RL for developing treatment policies in HIV, we are the first to evaluate policy gradient RL methods in such problems. Moreover, in this paper, we aim at modeling causal relationships between the options of anti-HIV drugs and the associated treatment effect, and introducing such causal factors into policy gradient learning process, in order to facilitate the final learning process and increase its interpretation.
Methods
In this section, we first provide basic introduction to RL and particularly the policy gradient RL, and then present the main procedure of the proposed causal policy gradient algorithm.
Policy gradient RL
RL enables an agent to learn effective strategies in sequential decision making problems by trial-and-error interactions with its environment [1]. The Markov decision process (MDP) has been used to formalize an RL problem, which has a long history in the research of theoretic decision making in stochastic settings. Formally, an MDP can be defined by a 5-tuple \(\mathcal {M} = \left (\mathcal {S}, \mathcal {A}, \mathcal {P}, \mathcal {R},\gamma \right)\), where \(\mathcal {S}\) is a finite state space, and \(s_{t}\in \mathcal {S}\) denotes the state of the agent at time t; \(\mathcal {A}\) is a set of actions available to the agent, and \(a_{t}\in \mathcal {A}\) denotes the action that the agent performs at time t; \(\mathcal {P}(s,a,s^{\prime }): \mathcal {S} \times \mathcal {A} \times \mathcal {S} \rightarrow [0,1]\) is a Markovian transition function when the agent transits from state s to state s^{′} after taking action a; \(\mathcal {R}: \mathcal {S} \times \mathcal {A} \rightarrow \mathfrak {R}\) is a reward function that returns the immediate reward \(\mathcal {R}(s,a)\) to the agent after taking action a in state s; and γ∈ [ 0,1] is a discount factor.
Since the Bellman operator\(\mathscr {B}^{\pi }\) is a contraction mapping of value function V, there exists a fixed point of value V^{π} such that \(\mathscr {B}^{\pi }V^{\pi }=V^{\pi }\) in the limit. The goal of an MDP problem is to compute an optimal policy π^{∗} such that \(V^{\pi ^{\ast }}(s)\geq V^{\pi }(s)\) for every policy π and every state s∈S.
where τ is the trajectory, θ is the parameter, and m is the number of trajectories.
In Eq. (3), ▽_{θ}U(θ) is the gradient of the policy, π_{θ}(τ,θ) is the probability of the occurrence of a trajectory(τ), ▽_{θ}logπ_{θ}(τ,θ) is the steepest direction when τ changes with θ, and R(τ) is the reward of a trajectory to control the updating direction and step size of parameter.
Incorporating causal factors into policy gradient RL
where P(A) represents the probability of occurrence of event A, P(A∩B) represents the probability of event A and event B occurring at the same time, and P(^{¬}A∩B) represents the probability that event A does not happen, but event B happens at the same time. The causal factor C_{(B|A)} can be computed using a sampling method proposed in [16].
where τ, θ, ▽_{θ}U(θ), π_{θ}(τ,θ), and ▽_{θ}logπ_{θ}(τ,θ) are the same as Eq. (3).
The Causal Policy Gradient (CPG) Algorithm
Algorithm 1: The CPG Algorithm |
---|
Function CPG |
Input: a differentiable policy parameterizations π(a|s,θ), ∀a∈A, s ∈S, θ∈R^{d}, C=0; |
Initialize policy parameter θ; |
Repeat forever: |
Define event A and event B; |
Generate an episode s_{0},a_{0},r_{1},...,s_{T−1},a_{T−1},r_{T}, following π(a|s,θ); |
For each step of the episode t=0,...,T-1: |
G ← average future return from step t; |
C=P(A∩B)/P(A)−P(^{¬}A∩B)/P(1−P(A)); |
θ←θ+α▽_{θ}logπ(a_{t}|s_{t},θ)∗G∗C; |
End for |
Return θ; |
End CPG |
Results
In this section, we evaluate CPG in the treatment of HIV to verify its effectiveness. We first briefly introduce the DTR problem in HIV and its RL formulations. We then use the direct PG algorithm to simulate HIV treatment, and investigate how the proposed CPG algorithm can be applied to solve this problem. Finally, we provide some discussions on the shortcomings of current research that need to be addressed in the future work.
MDP for DTRs in HIV
The simulated HIV treatment model [9] consists of a six dimensional continuous state space, including the concentrations of healthy CD^{4+} T-lymphocytes (T_{1}), healthy macrophages (T_{2}), healthy infected CD^{4+} T-lymphocytes \(\left (T^{*}_{1}\right)\), infected macrophages (\(T^{*}_{2}\)), free virus particles (V) and HIV-specific cytotoxic T-cells (E). The full drug interaction model is given by the Appendix.
While anti-retroviral treatment regimens are sometimes augmented by other types of drugs that enhance the effect of anti-HIV treatment, bolster the immune system, or reduce side effects, our current effort focuses on representatives of two main classes of enzymes: reverse transcriptase inhibitor (RTI) and protease inhibitor (PI). RTI prevents HIV RNA from being converted into DNA, thus blocking integration of the viral code into the target cell. On the other hand, PI affects the viral assembly process in the final stage of the viral life cycle, preventing the proper cutting and structuring of the viral proteins before their release from the host cell. PI therefore effectively reduces the number of infectious virus particles released by an infected cell. In all, there are four treatment regimens: only RTI on, only PI on, RTI and PI on, RTI and PI off. The four medication regimens are treated as four discrete actions.
Different equilibrium points of the six cells
Equilibrium point | T_{1} | T_{2} | T\(^{*}_{1}\) | T\(^{*}_{2}\) | V | E |
---|---|---|---|---|---|---|
The healthy, unstable state | 10^{6} | 3198 | 0 | 0 | 0 | 10 |
The healthy, locally stable state | 967839 | 621 | 76 | 6 | 415 | 353108 |
The non-healthy, locally stable state | 163573 | 5 | 11945 | 46 | 63919 | 24 |
RL for HIV treatment
Applying CPG for HIV treatment
Let the initial state, parameter value, observation period, patient number, initial strategy and other relevant variables be the same as “RL for HIV treatment” section. We also apply CPG to the HIV model. To define the causal factor, let event A represent the action taken each time (i.e., adding enzyme RTI or PI), event ^{¬}A mean no enzyme action, and event B mean the outcome of V>415 (i.e., the number of free virus particles is greater than 415). Thus, P(A) indicates the probability of taking an enzymatic action at each time, P(A∩B) represents the probability of simultaneous occurrence of free virus particles being greater than 415 and adding enzyme at the same time, and P(^{¬}A∩B) represents the probability of simultaneous occurrence of free virus particles being greater than 415 without adding enzyme. For each cause of treatment, we can count the frequencies of each event and then use these frequencies to indicate the corresponding probabilities.
Figure 5b shows the dynamic changes of causal factor (C) from the first episode to the 300th episode. During the early training stage, the causal factor is quite low, indicating little effect on the learning process. As the learning proceeds, the causal factors increase and reach a dynamic balance after around 50 episodes. Due to constantly random exploration in the learning process, the causal factor is always changing and finally close to 1. In the initial treatment phase, the patient’s V=63919 is much greater than 415. As learning proceeds, the medication policy became better and better and the regulative effect of causal factors also became stronger. After 50 episodes, the CPG algorithm learned the strategy of continuous dosing of both enzymes, so causal factors also reached the state of dynamic equilibrium.
Definition of different causal factors
Event | C1 | C2 |
---|---|---|
A | adding enzyme RTI or PI | adding enzyme RTI or PI |
^{¬}A | without adding enzyme | without adding enzyme |
B | V>415 | T_{2}<621 |
Discussion
where ξ_{1}∈[0, 1) and ξ_{2}∈[0, 1) are the control variables representing RTIs and PIs, respectively. In order to get a better strategy, we set a_{1}=0.0, a_{2}=0.0, b_{1}=0.7 and b_{2}=0.3. We used partial differential equations to solve the dynamics parameters, and applied Eqs. (7) and (8) to obtain the optimal strategy. Figure 6b plots the computed optimum strategy, in which the red line and the blue line represent the dosing of PI and RTI, respectively. It is clear to see that after 400 days of treatment, the two drugs are stopped, indicating a drug-free healthy stable state of patients.
The reasons why the general RL algorithms such as PG and CPG in this paper could not discover the optimal drug-free solutions lie in two main perspectives. On one hand, the MDP model adopted in this paper only considers four discrete actions that involve two types of enzyme and assigns a predefined fixed value of 0.7 and 0.3 to the parameters of ξ_{1} and ξ_{2}. This highly simplification makes it difficult or impossible to fully explore the whole state of the model in order to derive the optimal solution. Moreover, the reward function used in the MDP model is too abstract to reflect the complex dynamics of the treatments. On the other hand, the policy gradient algorithms themselves did not incorporate any sophisticated exploration strategies during the learning process. This is a critical problem since HIV treatment has long been recognized as a well-known testbed for evaluating advanced exploration algorithms in RL research [18, 19]. Previous studies have shown that the basin of attraction of the healthy steady-state in HIV is relatively small compared to the one of the non-healthy steady state. Thus, in the absence of drugs, perturbation of the uninfected steady state by adding as little as virus would lead to asymptotic convergence towards the non-healthy steady state.
Conclusions
Simulation-based DTR design has a series of advantages over cytopathological treatment in that it can avoid the harm to the patients during the exploration of drug, provide a large amount of treatment experience for the disease with insufficient case in reality, reduce the cost of actual treatment and shorten the duration of treatment. In this paper, we investigated the role of RL in DTRs for simulated patients with HIV. We showed that both the direct PG and its causal extension could obtain a better medication regimen after a period of learning, but the CPG algorithm was more efficient and robust due to incorporation of causal factors between options of anti-HIV drugs and the associated treatment outcomes. We also showed that different definitions of the causal factor could have significant influence on the final learning performance, indicating the necessity of human prior knowledge on defining a suitable causal relationships for a given problem. How to discover the most beneficial or optimal causal factors from historical interaction trajectories is thus important to automate the whole learning process. This will be left for our future work for further investigation.
Appendix
where T_{1}, T_{2}, T\(^{*}_{1}\), T\(^{*}_{2}\), V, E are the number of six cells; ξ_{1}∈[0, 1) and ξ_{2}∈[0, 1) are the control variables representing RTIs and PIs, respectively; W_{ij}>0 are the penalty multipliers; and η_{n} are the adjoint variables.
Notes
Acknowledgements
Not applicable.
Funding
This work is supported by the Hongkong Scholar Program under Grant No. XJ2017028, and Dalian High Level Talent Innovation Support Program under Grant 2017RQ008.
Availability of data and materials
The datasets used and/or analysed during the current study available from the first author on reasonable request.
About this supplement
This article has been published as part of BMC Medical Informatics and Decision Making Volume 19 Supplement 2, 2019: Proceedings from the 4th China Health Information Processing Conference (CHIP 2018). The full contents of the supplement are available online at URL. https://bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume-19-supplement-2.
Authors’ contributions
YC proposed the idea and drafted the manuscript. DY and RG contributed to the implementation, collection, analysis, and interpretation of experimental data. LJ supervised the research and proofread the manuscript. All authors contributed to the preparation, review, and approval of the final manuscript and the decision to submit the manuscript for publication. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge: The MIT press; 1998.Google Scholar
- 2.Parbhoo S, Bogojeska J, Zazzi M, Roth V, Doshi-Velez F. Combining kernel and model based learning for hiv therapy selection. AMIA Summits Transl Sci Proc. 2017; 2017:239.PubMedGoogle Scholar
- 3.Tseng H-H, Luo Y, Cui S, Chien J-T, Ten Haken RK, Naqa IE. Deep reinforcement learning for automated radiation adaptation in lung cancer. Med Phys. 2017; 44(12):6690–705.CrossRefGoogle Scholar
- 4.Daskalaki E, Diem P, Mougiakakou SG. Model-free machine learning in biomedicine: Feasibility study in type 1 diabetes. PloS ONE. 2016; 11(7):0158722.CrossRefGoogle Scholar
- 5.Shortreed SM, Laber E, Lizotte DJ, Stroup TS, Pineau J, Murphy SA. Informing sequential clinical decision-making through reinforcement learning: an empirical study. Machine learning. 2011; 84(1-2):109–36.CrossRefGoogle Scholar
- 6.Weng W-H, Gao M, He Z, Yan S, Szolovits P. Representation and reinforcement learning for personalized glycemic control in septic patients. 2017. arXiv preprint arXiv:1712.00654.Google Scholar
- 7.Hein D, Udluft S, Runkler TA. Interpretable policies for reinforcement learning by genetic programming. Eng Appl Artif Intell. 2018; 76:158–69.CrossRefGoogle Scholar
- 8.Adams BM, Banks HT, Kwon H-D, Tran HT. Dynamic multidrug therapies for hiv: Optimal and sti control approaches. Math Biosci Eng. 2004; 1(2):223–41.CrossRefGoogle Scholar
- 9.Ernst D, Stan G-B, Goncalves J, Wehenkel L. Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. In: 45th IEEE Conference on Decision and Control. New York: IEEE: 2006. p. 667–72.Google Scholar
- 10.Parbhoo S. A reinforcement learning design for hiv clinical trials. 2014. PhD thesis.Google Scholar
- 11.Gholizade-Narm H, Noori A. Control the population of free viruses in nonlinear uncertain hiv system using q-learning. Int J Mach Learn Cybern. 2018; 9(7):1169–79.CrossRefGoogle Scholar
- 12.Marivate VN, Chemali J, Brunskill E, Littman ML. Quantifying uncertainty in batch personalized sequential decision making. In: AAAI Workshop: Modern Artificial Intelligence for Health Analytics.Cambridge: The AAAI Press: 2014.Google Scholar
- 13.Killian T, Konidaris G, Doshi-Velez F. Transfer learning across patient variations with hidden parameter markov decision processes. 2016. arXiv preprint arXiv:1612.00475.Google Scholar
- 14.Wiering M, Van Otterlo M. Reinforcement learning. vol 12. Adapt Learn Optim. Berlin: Springer; 2012.Google Scholar
- 15.Watkins CJ, Dayan P. Q-learning. Mach Learn. 1992; 8(3-4):279–92.CrossRefGoogle Scholar
- 16.Merck CA, Kleinberg S. Causal explanation under indeterminism: A sampling approach. In: AAAI.Cambridge: The AAAI Press: 2016. p. 1037–43.Google Scholar
- 17.Williams RJ. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn. 1992; 8(3-4):229–56.CrossRefGoogle Scholar
- 18.Kawaguchi K. Bounded optimal exploration in mdp. In: AAAI.Cambridge: The AAAI Press: 2016. p. 1758–64.Google Scholar
- 19.Pazis J, Parr R. Pac optimal exploration in continuous space markov decision processes. In: AAAI.Cambridge: The AAAI Press: 2013.Google Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.