Projection based inverse reinforcement learning for the analysis of dynamic treatment regimes

Dynamic Treatment Regimes (DTRs) are adaptive treatment strategies that allow clinicians to personalize dynamically the treatment for each patient based on their step-by-step response to their treatment. There are a series of predefined alternative treatments for each disease and any patient may associate with one of these treatments according to his/her demographics. DTRs for a certain disease are studied and evaluated by means of statistical approaches where patients are randomized at each step of the treatment and their responses are observed. Recently, the Reinforcement Learning (RL) paradigm has also been applied to determine DTRs. However, such approaches may be limited by the need to design a true reward function, which may be difficult to formalize when the expert knowledge is not well assessed, as when the DTR is in the design phase. To address this limitation, an extension of the RL paradigm, namely Inverse Reinforcement Learning (IRL), has been adopted to learn the reward function from data, such as those derived from DTR trials. In this paper, we define a Projection Based Inverse Reinforcement Learning (PB-IRL) approach to learn the true underlying reward function for given demonstrations (DTR trials). Such a reward function can be used both to evaluate the set of DTRs determined for a certain disease, as well as to enable an RL-based intelligent agent to self-learn the best way and then act as a decision support system for the clinician.

treatment decision rules. It specifies how the treatments should be adjusted over time according to the dynamic state of the patient [1]. Each rule takes input information about the patient, e.g. medical history, laboratory measurements, demographics, etc., and recommends treatment options that aim to optimize the effectiveness of the treatment program [2]. Predefined DTRs are typically evaluated by adopting the Sequential Multiple Assignment Randomized Trial (SMART) method [3] which is a sequence of observations and treatments. The SMART design requires that a group of patients be randomized initially against the possible treatments and then re-randomized, based on the response of the patient at successive stages of the treatment.
The most relevant goal of a SMART study is the determination of the optimal DTR. Statistical techniques [4] have been adopted to achieve this goal. Patients with similar conditions may have different reactions to given treatments. Some patients respond appropriately, some do not respond at all, and others may have complications(even leading a worse condition). Communities and groups update best practices over time, since there are no ground truths with respect to best treatments [5,6]. This makes the existing / Published online: 21 October 2022 Applied Intelligence (2023) 53:14072-14084 system inexpedient [7]. On the other hand, Reinforcement Learning (RL) techniques learn and update themselves through trial and error. In clinical medicine, RL can be used to assign an optimal regime to patients with similar characteristics. Moreover, RL has been widely investigated, with the aim of identifying an optimal DTR. For example, Q-learning is among the earliest methods used to identify an optimal DTR, which fits linear outcome models in a recursive manner [8]. That is why RL techniques have an edge over existing supervised learning techniques.
If we consider treatments as "actions" and patient responses as "states" then data involved in a SMART study fit naturally into the RL paradigm [9]. RL algorithms [10,11] provide a framework for decision making [12] through learning a "policy" [13]. The policy indicates what action to perform in each situation in order to achieve a certain goal.
The RL model analyzes the "expected reward" of each DTR trial (a sequence of patient responses and recommended treatments) in order to identify the optimal DTR. An optimal DTR is the sequence of treatments with respect to the patient's condition (the patient response) which ultimately result in a positive change in the patient's health. Rewards are the numeric values that represent the effectiveness of the treatment for the current response of the patient. The higher the value of the rewards, the more effective the treatment will be.
However, in many RL problems the reward function may be extremely scattered under this setting and it may be difficult to recognize which actions are useful to obtain the ultimate feedback [14,15]. Clinically guided reward functions can help in identifying compact rewards [16] but this requires expert knowledge which is not available in the SMART method.
On the other hand, the proposed IRL can mitigate these associated problems with RL methods [17,18]. "DTR trials" are the sequences of recommended treatments and patient responses. The proposed IRL aims at learning the reward function for these "DTR trials". The values of the rewards indicate how good or bad these "DTR trials" are. In other words, the values of the rewards indicate how good is the response of the patient in relation to each recommended treatment. Once we have obtained the reward values for the DTRs, we can easily find the optimal DTR (the one which has the highest reward value) for a particular patient.
In this paper, we have proposed an IRL [19] technique to find the best fitting reward function for each "DTR trial". We have considered "DTR trials" as "demonstrations" in the rest of the paper. Given some information about the patient (i.e.., location, cultural level and gender) some demonstrations (sequences of the patient responses and recommended treatments) are generated. Our aim is to find the reward functions for these demonstrations, where the "reward" (as described above) is a signal that gives feedback to the decision-maker, indicating how good the recommended treatment is in terms of the patient response. To achieve this goal, we assume that the reward function can be expressed as a linear function of known "features" and "weight vectors". The weight vector is updated by finding similar policies to the demonstrated policies. A random search for similar policies can be time-consuming, especially in large state spaces. For this reason, we initially searched for random policies and later updated these policies by mixing them with their weights (explained in Section 4).
The results show that the proposed method is a quick and easy way to find the best fitting rewards for demonstrations. Once the true reward function is obtained, it can be used either to identify the optimal DTR or to train an RL-based agent that could act as a Decision Support System for the clinician to treat the patient.
The rest of the paper is organized as follows: Section 2 comprises a quick review of the background and problem formulation, where introductory concepts about the Markov Decision Process (MDP), RL and IRL are presented. We will discuss a case scenario in Section 3. The proposed approach and system model are detailed in Section 4. This section describes the problem of learning the reward function not explicitly, but through observing trajectories or expert demonstrations. A discussion about the experimental model, data description and results is presented in Section 5. The rewards learned from the proposed algorithm are tested on existing RL models in Section 6. Related work is described in Section 7 where a comparison between different methods of Imitation Learning (IL) in DTRs is discussed. We summarize the paper in Section 8.

Reinforcement Learning (RL)
The idea of underlying RL [20] is to interact with the surrounding environment without having any prior knowledge and to learn from these interactions how to achieve a goal. A RL problem is typically described through a MDP. It holds the Markov property, which does not consider past information while taking actions in a current state. An MDP is a tuple (S, A, T , R, γ ) [21] as explained in Table 1.
The goal of RL methods is to learn the behavior of the environment through repetitive interactions. The component that interacts is called the Agent. An agent takes an action a t in a state s t at a time step t and the environment returns the next state s t+1 and reward r t+1 . The aim of the Agent is to learn an optimal policy. The policy (it is a RL terminology) indicates what action to take in a state in order to maximize the cumulative reward. A value function The state which consists of the patient's demographics and clinical variables. a t ∈ A What medication is chosen for a patient at t time step.

T (ś|s, a)
Probability of ending up in stateś by taking action a in state s. T ∈ [0, 1].

R(ś|s, a)
Reward for taking action a in state s. where R ∈ 0 ≤ γ ≤ 1 Discount factor ∈ [0, 1]. A value of zero gives more weight to immediate rewards and a value close to one gives more weight to long term rewards.

π E
The behavior policy of an expert τ E = [τ 1 , τ 2 , τ 3 ...] τ E is the Expert trajectory: where τ i ∈ τ E describes the episode of expert trajectory. V π (s) at a state s is the expected reward by following a policy π [22]: Where r is the reward function, T (s , a , s ) is the transition probability and γ is the discount factor as described in Table 1.
The goal of the RL agent is to find the optimal policy π * (the one with the highest value). The value function V (s) for the optimal policy π * using the Bellman equation is: Equation 2 returns the best possible value V * by exploring all the states S.
In some RL algorithms (e.g. SARSA), the linear functions of features φ i (s, a) are used to find the optimal policy [23].
Where Q(s, a) is the Q − value function at each state (s), φ i (s, a) is the feature function φ : S → [0, 1] k and w is the weight vector that can be updated as: Where η is step size, γ is discount factor , and s , a are the future state and action, respectively. In many applications, it is difficult to specify the reward function. These reward functions are sometimes very sparse and are not accurate. Therefore, to calculate the optimal policy in an uncertain environment, researchers have exploited IRL techniques to find the best fitting reward function against a set of given trajectories or expert demonstrations.

Inverse Reinforcement Learning (IRL)
IRL, also known as Learning from Demonstration (LD) [24], is the process of learning the preferences of an expert (agent) by observing its behavior and avoiding the specification of the reward function. IRL flips the RL problem and attempts to find an underlying reward function that is supposed to explain the expert behavior perfectly. Expert trajectories (sequences of states and actions), also known as expert demonstrations, represent the optimal policy. A policy may be optimal for many different reward functions. Therefore, the goal here is to find the true reward function that is being implicitly optimized by the optimal policy π * . Linear Programming (LP) IRL [25] assumes that the expected value for an expert policy is always higher than that of all other policies. The expectation of the value function (as defined in (1)) is calculated as: This formulation maximizes the smallest difference between the expected value generated by the expert policy and the expected value generated by non-expert policies. The reward function that maximizes this difference is considered to be the true reward function which perfectly explains the expert behavior. These methods examine almost all the policies to find the best one. In a very large state space, it is very difficult and time-consuming to analyze all the actions in all states. To mitigate this problem, PB-IRL has been proposed to allow the agent to quickly explore underlying reward functions for the expert policy. Algorithm 1 Proposed algorithm for the reward estimation. Figure 1 shows DTRs for the management of people with alcohol addiction. The patients are categorized according to the input information (e.g. Location, Cultural level and Gender). Each circle with an R denotes a Randomization stage of the DTR. At each stage R, a new treatment is recommended according to the response of the patient. There are three basic treatments that are possible: Medication therapy (MED), Psychological therapy (PSY), or Telephone Monitoring (TM), plus a combination of these treatments e.g. PSY+MED. Each patient is arbitrarily assigned to one of the available treatments.

Use case scenario
A patient is classified as a Responder if he/she has had fewer than two drinking days during the previous 2 months. Otherwise, the patient is classified as a Non-Responder. Consequently, for example, a patient who is a Non-Responder to PSY treatment is re-randomized to MED

Proposed technique
Let us consider a set of demonstrations or trajectories, τ E as below: A demonstration τ i represents the sequence of expert (physician) decisions. It contains the expert polices π E . In RL, the policy is to map a state to an action. In our model, actions (a i , i ∈ {0, 1, .., d}) represent the set of treatments and states (s t ∈ S) represent the set of patient's response for each recommended treatment.
The goal is to estimate the true underlying reward function for these demonstrations. Rewards are the numerical values which indicate which treatment (action) is most effective against each of the patient's responses (states).
The value V of a policy is the sum of all the discounted rewards collected by following that policy as given in (1). The expected value E[V E (s)] is defined as: where φ represents k known, fixed and bounded basis functions (or feature functions) φ : S → [0, 1] k and w i R are weights. We define an expected accumulated feature value vector as a feature expectation μ E : The (10) can be represented as: Expert feature expectations μ E are calculated by using expert policies π E (given in the demonstrations) while estimated feature expectation μ(π) are calculated by using estimated policiesπ. The idea here is to find a random policyπ and the estimate feature expectations μ(π). If the difference between the expert feature expectation μ E and the estimated feature expectation μ(π) is not small enough to some predefined threshold value then; • update weight vector w and calculate reward function • by using the current reward function apply RL method to update the estimated policy and hence update the estimated feature expectation μ(π).
We are trying to find a reward function that helps RL to estimate a policy which minimizes the difference between μ(π) and μ E . It is computationally complex to estimate such a policy by random searching, especially in a large state space. Alternatively, we can generate a set of random policies {π 1 , . . . .,π d } and mix them with a mixer weight λ i in order to obtain a new policy. The probability of choosinĝ π i is given by λ i . Feature expectation μ of new policy is a convex combination of the feature expectations of these randomly generated policies: According to the Caratheodory's theorem [26], any point that is a convex combination of a set of N points with N > k + 1, can be written as a convex combination of a subset of only k + 1 points.
where C0 denotes the convex hull. Fig. 2 The layout of the proposed model for dynamic treatment regime By doing so we can obtain a set of k + 1 policies which can generate estimated feature expectations that are equally close to the expert feature expectation: we set The weight vector w i defines the values of the underlying reward function through dot multiplication with the feature vector φ. The reward function might be a linear combination of features φ.
Threshold t represents the termination criteria. The algorithm terminates if t obtains less than the predefined parameter .
The dataset of the DTR trials provides the demonstrations τ E and hence expert policies π E . From these demonstrations, it is possible to estimate μ E by following the procedure in (8)- (12).
On the other hand, an iterative process estimates the underlying reward function for these trajectories and updates the estimated policy accordingly to Algorithm 1.
At each iteration, 1) a reward is calculated with respect to the current weight vector (R = i w i φ i ); 2) the RL method (V alue Iteration method) is executed to estimate the policyπ i for current reward; 3) the feature expectation μ is updated andμ i−1 is Computed through (15); and 4) the weight vector is updated by calculating the difference between w i = μ E andμ (i−1) .
This process terminates as the Euclidean distance (||μ E − μ (i−1) || 2 ) becomes less than or equal to some predefined value .

Experiments
In this section, we report the results of an experiment that we have conducted to evaluate the proposed model. First, we will describe the dataset and test. Next, we will present the results and validation.

Model setup
The DTR under examination (Fig. 1) is a two-stage decision process. We completed and mapped such a process into the RL modal as represented in Fig. 3. The model consists of seven states {s 0 , ..., s 6 }. s 0 is the initial state. s 1 and s 3 represents the "responder(Res)" state to "PSY" and "MED" actions respectively. Similarly, s 2 and s 4 represent "Nonresponder" responses of the patients for "PSY" and "MED" actions. While s 5 and s 6 are the terminal states.
Our model has four possible actions(treatments) { MED , P SY , T M , P SY + MED }, and eight probability distributions T s a . Where T is the probability that the patient is 'Responder' being in the state s and selecting an action a.
There are t lhree pre-treatment features (i.e Gender, Location and Cultural level). We have a total of 12 possible combinations of these pre-treatment features(i.e 2 Genders, 2 Locations, 3 Cultural levels = 12) as shown in Table 2.

Dataset description
In order to validate the proposed approach, we have generated a dataset by adopting the SMART method [3,27,28].
The referred DTRs are shown in Fig. 3. In this trial, each participant is randomly assigned to one of two possible initial treatments: psychology (P SY ) or medicine (MED). As already described in a previous section, the participants are classified as Non-Responders (NR) or Responders (Res) to the initial treatment according to whether they do (or do not) experience more than two heavy-drinking days during the two-month time period.  There are 12 possible combinations of pre-treatment features see Table 2. We have selected 500 demonstrations(DTR trials) for each of these combinations. An example of the demonstration for the pre-treatment parameter M-O-L is; "[s 0 , P SY , s 2 , MED, s 5 ]", where s 2 and s 5 represent the "Responder" and "Success" states, respectively (see Fig. 3). These trends show the quick convergence behavior of the algorithm. The final difference value reduces below the threshold in less than 30 iterations; i.e., the last estimated reward values generate feature expectations very close to those generated by the expert policy. On the other hand, the expected rewards at each state are estimated by exercising the proposed model for a given number of demonstrations, as shown in Table 3. These rewards provide the best representation of the demonstration for each pre-treatment feature. Consider "M-O-L" as the pre-treatment features. There were three types of repeated demonstration for "M-O-L" with the following shares: There were some negative demonstrations that ended at state s 6 (the failure state). We only selected the positive demonstrations which end at state s 5 (the success state). The estimated rewards at each state for these demonstrations are presented in Table 3.

Results
Next, we aimed to test how good these reward functions are. To achieve that objective, we used RL (the V alue − I teration) algorithm, with the rewards already estimated through the proposed model. V alues for the different pretreatment features were obtained by RL algorithm and are listed in Table 4. These values represent the preference of the RL agent at each stage. For example, let's consider a patient with the characteristics M − O − L. In stage-1, the value of the treatment MED is greater than the value of the treatment P SY . Therefore, MED is preferred at stage-1. Similarly, at stage-2 there are four possible actions and among these, T M has the highest value. Therefore, T M is the preferred action in stage-2. Similarly, Success has the highest value in the final stage. Hence, the reproduced trajectory through the RL algorithm by using the estimated reward function is: This reproduced trajectory is the same as that given in the demonstrations with the highest percentage share. This means that the estimated reward function is the best explanation for these demonstrations. All the other rows in the Table 4 represent the values of each action for every pre-treatment feature.
In case, the RL-agent selecting MED at stage-1, then all the possible actions it can take at the second stage are P SY , T M and P SY + MED (see Fig. 3). By considering such a case, the aggregated values of each possible combination of actions at both stages are represented in Fig. 5. "MED P SY " meaning that the MED and P SY actions are chosen at stage-1 and stage-2, respectively. The graph shows that MED T M has a higher aggregated value for all the pretreatment features compared with MED P SY and MED P SY + MED.
On the other hand, Fig. 6 represents the aggregated values considering P SY action at stage-1. All the possible actions at stage-2 that the RL agent can take for this case

Testing estimated rewards on existing models
As discussed earlier, it is very hard to find the true rewards for many RL problems, especially in a large state space. In many problems, researchers have to start with random rewards, that are difficult to tune in complex environments. Q-learning [29] and SARSA [30] are two important RL algorithms that are used in health care and personalized medicine. Usually, in existing models, rewards are randomly assigned to each state at the beginning. On the other hand, in the proposed approach, we estimate the best fitting rewards for each state.
In this section, we present a learning comparison of both (Q-learning and SARSA) models, using the existing approach with the estimated rewards via the proposed In the existing technique, rewards are randomly assigned to each state, while in the second case we used the rewards that were learned through the proposed IRL method algorithm. A convergence performance of both RL models on the same DTR problem is presented in Fig. 7.
The learning curves of the Q-learning model, for both cases, are presented in Fig. 7(a). It can be seen that learning from the "IRL rewards" is faster than learning from the "existing approach". With the IRL rewards, almost 415 episodes were needed to learn the DTR environment, while in the other case almost 809 episodes were required.
Similarly, the case of learning the DTR environment by adopting SARSA model is presented in Fig. 7(b). Once again, the learning curve is much faster in the case where the rewards obtained through the proposed algorithm are being used. It took almost 712 episodes to learn the SARSA model when the "existing approach" is used; whereas, it took almost 252 episodes when the "IRL rewards" were used. Both the models performed much better in terms of learning with the rewards that were obtained through the proposed IRL approach.
It is important to note that speeding up the learning process of a RL model is crucial in the case of a DTR. As already mentioned in the previous sections, statistical approaches (e.g., SMART) perform the execution of DTR trials (demonstrations). After the trials are completed, one can train an intelligent agent to get support in decisions making. However, to train a RL model from scratch takes many new "attempts", the number of which can be reduced by using the proposed approach. To conclude, the proposed approach can be effectively adopted: 1. to automatically estimate the "value" of each DTR (which helps in finding the optimal DTR); and 2. to reduce the time taken and the number of attempts needed to train the RL model.

Related work
The application of Machine Learning (ML) techniques in biomedicine [31], minimizing medication errors during home treatment [32,33], risk management [34], communication [35][36][37] and healthcare [38][39][40] has been increased extensively in recent years. DTRs [41,42] oversimplify personalized medicine to time-varying treatment settings in which the treatment is frequently tailored to a patient's dynamic-state. DTRs are alternatively known as adaptive treatment strategies [43][44][45] or treatment policies [46,47]. Behavior Cloning (BC) and RL [48,49] are two methods exploited to learn DTRs. BC [50] learns the policy through supervised learning by the direct mapping of states to actions. It can avoid interacting with the environment. However, without considerable improvement during the training, BC introduces a compounding error [51] over the length of the trajectories. BC can effectively recover the doctor's policies when the Electronic Health Record (EHR) is optimal and plentiful. However, these methods are not suitable if the ground truth of the treatment is unclear [7].
On the other hand, the RL [52] and Deep Reinforcement Learning (DRL) [53] methods are based on maximizing the long-term reward of the patients [14,48] from directly learning a policy. However, the learned policy is highly reliant on the accuracy of the pre-defined reward function. The true reward function is very crucial in RL in terms of finding the optimal DTR. In addition, these models are not explainable to an expert domain, due to the lack of true reward function, which is important in this sensitive application.

Conclusions and future directions
In this paper, we have proposed a PB-IRL approach to learn the underlying reward function for the automatic analysis of the DTRs from their trials (demonstrations). The reward function explains the behavior of experts (physicians) more effectively in comparison with RL policies. We have also shown that a RL-based decision support system can be directly and efficiently trained by adopting this reward function.
The reward function for demonstrations is estimated by learning the appropriate policies. Such policies can easily be found by mixing all the estimated policies corresponding to their mixture weights (as explained in Section 4). In a large state space, it is an easy and rapid way to obtain an optimal policy and hence the true reward function for the given demonstrations. In order to validate the approach, we have generated some demonstrations, applied the proposed algorithm, and estimated the reward function for these demonstrations. In order to check how good these estimated rewards are, we have exercised the V alue iteration RL algorithm and learned the policies. By comparing these learned policies with the demonstrated policies we have proved the accuracy of the algorithm. The results show that the proposed algorithm provides better dynamic treatment regimes and helps the existing models to learn the environment fast.
The proposed IRL technique mitigates the problems associated with the RL, but there are some limitations to be aware of. IRL methods only consider the positive demonstrations or trajectories (i.e., those that result in the recovery or the survival of the patients) and learn a policy to recover these trajectories. The information embedded in negative trajectories (e.g. those that result in unsuccessful treatments) has been largely ignored. However, it could potentially help the learned policy to avoid repeating mistakes. Both positive and negative trajectories can be used to deduce the best practice and avoid errors in dynamic treatment regimes. These techniques, therefore, may increase the likelihood of patient survival through the exploitation of information from both positive and negative trajectories and provide better dynamic treatment regime. Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.