1 Introduction

A dynamic treatment regime DTR, also known as an adaptive treatment strategy, is a sequence of tailored treatment decision rules. It specifies how the treatments should be adjusted over time according to the dynamic state of the patient [1]. Each rule takes input information about the patient, e.g. medical history, laboratory measurements, demographics, etc., and recommends treatment options that aim to optimize the effectiveness of the treatment program [2]. Predefined DTRs are typically evaluated by adopting the Sequential Multiple Assignment Randomized Trial (SMART) method [3] which is a sequence of observations and treatments. The SMART design requires that a group of patients be randomized initially against the possible treatments and then re-randomized, based on the response of the patient at successive stages of the treatment.

The most relevant goal of a SMART study is the determination of the optimal DTR. Statistical techniques [4] have been adopted to achieve this goal. Patients with similar conditions may have different reactions to given treatments. Some patients respond appropriately, some do not respond at all, and others may have complications(even leading a worse condition). Communities and groups update best practices over time, since there are no ground truths with respect to best treatments [5, 6]. This makes the existing system inexpedient [7]. On the other hand, Reinforcement Learning (RL) techniques learn and update themselves through trial and error. In clinical medicine, RL can be used to assign an optimal regime to patients with similar characteristics. Moreover, RL has been widely investigated, with the aim of identifying an optimal DTR. For example, Q-learning is among the earliest methods used to identify an optimal DTR, which fits linear outcome models in a recursive manner [8]. That is why RL techniques have an edge over existing supervised learning techniques.

If we consider treatments as “actions” and patient responses as “states” then data involved in a SMART study fit naturally into the RL paradigm [9]. RL algorithms [10, 11] provide a framework for decision making [12] through learning a “policy” [13]. The policy indicates what action to perform in each situation in order to achieve a certain goal.

The RL model analyzes the “expected reward” of each DTR trial (a sequence of patient responses and recommended treatments) in order to identify the optimal DTR. An optimal DTR is the sequence of treatments with respect to the patient’s condition (the patient response) which ultimately result in a positive change in the patient’s health. Rewards are the numeric values that represent the effectiveness of the treatment for the current response of the patient. The higher the value of the rewards, the more effective the treatment will be.

However, in many RL problems the reward function may be extremely scattered under this setting and it may be difficult to recognize which actions are useful to obtain the ultimate feedback [14, 15]. Clinically guided reward functions can help in identifying compact rewards [16] but this requires expert knowledge which is not available in the SMART method.

On the other hand, the proposed IRL can mitigate these associated problems with RL methods [17, 18]. “DTR trials” are the sequences of recommended treatments and patient responses. The proposed IRL aims at learning the reward function for these “DTR trials”. The values of the rewards indicate how good or bad these “DTR trials” are. In other words, the values of the rewards indicate how good is the response of the patient in relation to each recommended treatment. Once we have obtained the reward values for the DTRs, we can easily find the optimal DTR (the one which has the highest reward value) for a particular patient.

In this paper, we have proposed an IRL [19] technique to find the best fitting reward function for each “DTR trial”. We have considered “DTR trials” as “demonstrations” in the rest of the paper. Given some information about the patient (i.e.., location, cultural level and gender) some demonstrations (sequences of the patient responses and recommended treatments) are generated. Our aim is to find the reward functions for these demonstrations, where the “reward” (as described above) is a signal that gives feedback to the decision-maker, indicating how good the recommended treatment is in terms of the patient response. To achieve this goal, we assume that the reward function can be expressed as a linear function of known “features” and “weight vectors”. The weight vector is updated by finding similar policies to the demonstrated policies. A random search for similar policies can be time-consuming, especially in large state spaces. For this reason, we initially searched for random policies and later updated these policies by mixing them with their weights (explained in Section 4).

The results show that the proposed method is a quick and easy way to find the best fitting rewards for demonstrations. Once the true reward function is obtained, it can be used either to identify the optimal DTR or to train an RL-based agent that could act as a Decision Support System for the clinician to treat the patient.

The rest of the paper is organized as follows: Section 2 comprises a quick review of the background and problem formulation, where introductory concepts about the Markov Decision Process (MDP), RL and IRL are presented. We will discuss a case scenario in Section 3. The proposed approach and system model are detailed in Section 4. This section describes the problem of learning the reward function not explicitly, but through observing trajectories or expert demonstrations. A discussion about the experimental model, data description and results is presented in Section 5. The rewards learned from the proposed algorithm are tested on existing RL models in Section 6. Related work is described in Section 7 where a comparison between different methods of Imitation Learning (IL) in DTRs is discussed. We summarize the paper in Section 8.

2 Preliminaries

2.1 Reinforcement Learning (RL)

The idea of underlying RL [20] is to interact with the surrounding environment without having any prior knowledge and to learn from these interactions how to achieve a goal. A RL problem is typically described through a MDP. It holds the Markov property, which does not consider past information while taking actions in a current state. An MDP is a tuple (S,A, T,R,γ) [21] as explained in Table 1.

Table 1 Description of related notations

The goal of RL methods is to learn the behavior of the environment through repetitive interactions. The component that interacts is called the Agent. An agent takes an action at in a state st at a time step t and the environment returns the next state st+ 1 and reward rt+ 1. The aim of the Agent is to learn an optimal policy. The policy (it is a RL terminology) indicates what action to take in a state in order to maximize the cumulative reward. A value function Vπ(s) at a state s is the expected reward by following a policy π [22]:

$$ \begin{array}{@{}rcl@{}} V^{\pi} (s)&= & \sum\limits_{a\in A} \pi(a \vert s) \sum\limits_{s^{\prime} \in S , r} T (s ,a , s^{\prime})\\ &&\left\{\right. \left.\!\!r(s,a) + \gamma V^{\pi}(s^{\prime}) \right\} \end{array} $$
(1)

Where r is the reward function, \(T (s ,a , s^{\prime })\) is the transition probability and γ is the discount factor as described in Table 1.

The goal of the RL agent is to find the optimal policy π (the one with the highest value). The value function V (s) for the optimal policy π using the Bellman equation is:

$$ V^{*} (s)= \max_{a} \sum\limits_{s^{\prime},r} T (s ,a , s^{\prime}) [ r + \gamma V^{*}(s^{\prime})] $$
(2)

Equation 2 returns the best possible value V by exploring all the states S.

In some RL algorithms (e.g. SARSA), the linear functions of features ϕi(s,a) are used to find the optimal policy [23].

$$ Q_{w} (s,a)=\sum\limits_{i} w_{i} \phi_{i} (s,a) $$
(3)

Where Q(s,a) is the Qvalue function at each state (s), ϕi(s,a) is the feature function ϕ : S → [0,1]k and w is the weight vector that can be updated as:

$$ \begin{array}{@{}rcl@{}} w_{i} \leftarrow w_{i} + \eta \delta \phi(s,a) \\ \delta = r + \gamma Q_{w} (s^{\prime}, a^{\prime}) - Q_{w} (s,a) \end{array} $$
(4)

Where η is step size, γ is discount factor , and \(s^{\prime }\), \(a^{\prime }\) are the future state and action, respectively. In many applications, it is difficult to specify the reward function. These reward functions are sometimes very sparse and are not accurate. Therefore, to calculate the optimal policy in an uncertain environment, researchers have exploited IRL techniques to find the best fitting reward function against a set of given trajectories or expert demonstrations.

2.2 Inverse Reinforcement Learning (IRL)

IRL, also known as Learning from Demonstration (LD) [24], is the process of learning the preferences of an expert (agent) by observing its behavior and avoiding the specification of the reward function. IRL flips the RL problem and attempts to find an underlying reward function that is supposed to explain the expert behavior perfectly. Expert trajectories (sequences of states and actions), also known as expert demonstrations, represent the optimal policy. A policy may be optimal for many different reward functions. Therefore, the goal here is to find the true reward function that is being implicitly optimized by the optimal policy π.

Algorithm 1
figure a

Proposed algorithm for the reward estimation.

Linear Programming (LP) IRL [25] assumes that the expected value for an expert policy is always higher than that of all other policies. The expectation of the value function (as defined in (1)) is calculated as:

$$ \begin{array}{@{}rcl@{}} \forall_{s_{0} \epsilon S } , \forall_{a \epsilon A} \textbf{:} && \mathbb{E}_{\acute{s} \backsim T(\acute{s} \vert s , \pi(s))} \{V^{*}(\acute{s}) \} \geq \\ && \mathbb{E}_{\acute{s} \backsim T(\acute{s} \vert s , a)} \{V^{\pi}(\acute{s}) \}\\ \max \sum\limits_{s\in S_{0}} \underset{a \in A \vert \pi(s)}{\min} {} && \{ \mathbb{E}_{\acute{s} \backsim T(\acute{s} \vert s , \pi(s))} \{V^{*}(\acute{s}) \} - \\ && \{\mathbb{E}_{\acute{s} \backsim T(\acute{s} \vert s , a}) \{V^{\pi}(\acute{s}) \} \} \end{array} $$
(5) (6)

This formulation maximizes the smallest difference between the expected value generated by the expert policy and the expected value generated by non-expert policies. The reward function that maximizes this difference is considered to be the true reward function which perfectly explains the expert behavior. These methods examine almost all the policies to find the best one. In a very large state space, it is very difficult and time-consuming to analyze all the actions in all states. To mitigate this problem, PB-IRL has been proposed to allow the agent to quickly explore underlying reward functions for the expert policy.

3 Use case scenario

Figure 1 shows DTRs for the management of people with alcohol addiction. The patients are categorized according to the input information (e.g. Location, Cultural level and Gender). Each circle with an R denotes a Randomization stage of the DTR. At each stage R, a new treatment is recommended according to the response of the patient. There are three basic treatments that are possible: Medication therapy (MED), Psychological therapy (PSY ), or Telephone Monitoring (TM), plus a combination of these treatments e.g. PSY+MED. Each patient is arbitrarily assigned to one of the available treatments.

Fig. 1
figure 1

A dynamic treatment regime

A patient is classified as a Responder if he/she has had fewer than two drinking days during the previous 2 months. Otherwise, the patient is classified as a Non-Responder. Consequently, for example, a patient who is a Non-Responder to PSY treatment is re-randomized to MED treatment, or a combination of PSY with MED (PSY+MED) treatment. Instead, participants who are classified as Responders to the initial treatment are checked via TM for an additional period.

4 Proposed technique

Let us consider a set of demonstrations or trajectories, τE as below:

$$ \begin{array}{@{}rcl@{}} \tau_{E} &=& \left[({s^{1}_{1}},{a^{1}_{1}},{s^{1}_{2}},{a^{1}_{2}} , ..,{s^{1}_{d}},{a^{1}_{d}} ),\right. \\ && \left. ({s^{2}_{1}},{a^{2}_{1}},{s^{2}_{2}},{a^{2}_{2}} , ..,{s^{2}_{d}},{a^{2}_{d}}), ...\right] \\ &=& \left[\tau^{1},\tau^{2},\tau^{3} ... \right] \end{array} $$
(7)

A demonstration τi represents the sequence of expert (physician) decisions. It contains the expert polices πE. In RL, the policy is to map a state to an action. In our model, actions (ai,i ∈{0,1,..,d}) represent the set of treatments and states (stS) represent the set of patient’s response for each recommended treatment.

The goal is to estimate the true underlying reward function for these demonstrations. Rewards are the numerical values which indicate which treatment (action) is most effective against each of the patient’s responses (states).

The value V of a policy is the sum of all the discounted rewards collected by following that policy as given in (1). The expected value E[VE(s)] is defined as:

$$ E[V_{E} (s)]=E[ \sum\limits^{\infty}_{t=1} \gamma^{t} R(s_{t}) \vert {\pi_{E}}] $$
(8)
$$ =E[ \sum\limits^{\infty}_{t=1} \gamma^{t} w.\phi(s_{t}) \vert {\pi_{E}}] $$
(9)
$$ =w . E[ \sum\limits^{\infty}_{t=1} \gamma^{t} \phi(s_{t}) \vert {\pi_{E}}] $$
(10)

where ϕ represents k known, fixed and bounded basis functions (or feature functions) ϕ : S → [0,1]k and \(w_{i} \epsilon \mathbb {R}\) are weights. We define an expected accumulated feature value vector as a feature expectation μE:

$$ \mu_{E} = E[ \sum\limits^{\infty}_{t=1} \gamma^{t} \phi(s_{t}) \vert {\pi_{E}}] $$
(11)

The expected value E[VE(s)] in (10) can be represented as:

$$ E[V_{E}(s)] = w . \mu_{E} $$
(12)

Expert feature expectations μE are calculated by using expert policies πE (given in the demonstrations) while estimated feature expectation \(\mu (\hat {\pi })\) are calculated by using estimated policies \(\hat {\pi }\).

The idea here is to find a random policy \(\hat {\pi }\) and the estimate feature expectations \(\mu (\hat {\pi })\). If the difference between the expert feature expectation μE and the estimated feature expectation \(\mu (\hat {\pi })\) is not small enough to some predefined threshold value 𝜖 then;

  • update weight vector w and calculate reward function (\(R={\sum }_{i} w_{i} \phi _{i}\)).

  • by using the current reward function apply RL method to update the estimated policy and hence update the estimated feature expectation \(\mu (\hat {\pi })\).

We are trying to find a reward function that helps RL to estimate a policy which minimizes the difference between \(\mu (\hat {\pi })\) and μE. It is computationally complex to estimate such a policy by random searching, especially in a large state space.

Alternatively, we can generate a set of random policies \(\{\hat {\pi }^{1} ,\ldots ., \hat {\pi }^{d} \}\) and mix them with a mixer weight λi in order to obtain a new policy. The probability of choosing \(\hat {\pi }^{i}\) is given by λi. Feature expectation μ of new policy is a convex combination of the feature expectations of these randomly generated policies:

$$ {\mu} = \sum\limits_{i} \lambda_{i} {\mu}^{(i)} ; \lambda_{i} \geq 0, \sum\limits_{i} \lambda_{i} = 1. $$
(13)

According to the Caratheodory’s theorem [26], any point that is a convex combination of a set of N points with N > k + 1, can be written as a convex combination of a subset of only k + 1 points.

$$ \mu = \arg \underset{\mu \in{C0} \{\mu(\hat{\pi}^{i} )\}^{n}_{i=0}}{\min} ||\mu_{E}-\mu||_{2} $$
(14)

where C0 denotes the convex hull.

By doing so we can obtain a set of k + 1 policies which can generate estimated feature expectations that are equally close to the expert feature expectation:

$$ \begin{array}{@{}rcl@{}} \hat{\mu}^{i-1}=\hat{\mu}^{i-2}+ \frac{({\mu^{i-1}-\hat{\mu}^{i-2})}^{T}(\mu_{E} - \hat{\mu}^{i-2})}{({\mu^{i-1}-\hat{\mu}^{i-2})}^{T}(\mu^{i-1}-\hat{\mu}^{i-2})} \\{\left[\vphantom{\hat{\mu}^{i-1}=\hat{\mu}^{i-2}}\right.} \left.(\mu^{i-1}-\hat{\mu}^{i-2}) \right] \end{array} $$
(15)

we set

$$ w^{i}=\mu_{E} - \hat{\mu}^{(i-1)} $$
(16)

The weight vector wi defines the values of the underlying reward function through dot multiplication with the feature vector ϕ. The reward function might be a linear combination of features ϕ.

$$ \begin{array}{@{}rcl@{}} R(s,a)&=& w_{1} \phi_{1} (s,a)+w_{2} \phi_{2} (s,a)+ \ldots{\ldots} +w_{k} \phi_{k} (s,a) \\ R(s,a)&=&\sum\limits_{i} \{w_{i} \phi_{i} (s,a) \} \end{array} $$
(17)

Threshold t represents the termination criteria. The algorithm terminates if t obtains less than the predefined parameter 𝜖.

$$ t= \left\|\mu_{E} - \hat{\mu}^{(i-1)}\right\|_{2} \leq \epsilon $$
(18)

Initially, we set \(w^{1} = \mu _{E} - \hat {\mu }^{0}\), where \(\hat {\mu ^{(0)}} = \mu ^{(0)}\). The workflow is shown in Fig. 2.

Fig. 2
figure 2

The layout of the proposed model for dynamic treatment regime

The dataset of the DTR trials provides the demonstrations τE and hence expert policies πE. From these demonstrations, it is possible to estimate μE by following the procedure in (8)-(12).

On the other hand, an iterative process estimates the underlying reward function for these trajectories and updates the estimated policy accordingly to Algorithm 1.

At each iteration, 1) a reward is calculated with respect to the current weight vector (\(R={\sum }_{i} w_{i} \phi _{i}\)); 2) the RL method (V alue Iteration method) is executed to estimate the policy \({\hat {\pi }}^{i}\) for current reward; 3) the feature expectation μ is updated and \(\hat {\mu }^{i-1}\) is Computed through (15); and 4) the weight vector is updated by calculating the difference between wi = μE and \( \hat {\mu }^{(i-1)}\).

This process terminates as the Euclidean distance (\(||\mu _{E} - \hat {\mu }^{(i-1)}||_{2} \)) becomes less than or equal to some predefined value 𝜖.

5 Experiments

In this section, we report the results of an experiment that we have conducted to evaluate the proposed model.

First, we will describe the dataset and test. Next, we will present the results and validation.

5.1 Model setup

The DTR under examination (Fig. 1) is a two-stage decision process. We completed and mapped such a process into the RL modal as represented in Fig. 3. The model consists of seven states {s0,...,s6}. s0 is the initial state. s1 and s3 represents the “responder(Res)” state to “PSY” and “MED” actions respectively. Similarly, s2 and s4 represent “Non-responder” responses of the patients for “PSY” and “MED” actions. While s5 and s6 are the terminal states.

Fig. 3
figure 3

Mapping the DTR trials as a RL model

Our model has four possible actions(treatments) \(\{^{\prime }MED^{\prime }\), \(^{\prime }PSY^{\prime }\), \(^{\prime }TM^{\prime }\), \(^{\prime }PSY+MED^{\prime }\}\), and eight probability distributions \(T_{s\_a}\). Where T is the probability that the patient is `Responder’ being in the state s and selecting an action a.

There are t lhree pre-treatment features (i.e Gender, Location and Cultural level). We have a total of 12 possible combinations of these pre-treatment features(i.e 2 Genders, 2 Locations, 3 Cultural levels = 12) as shown in Table 2. These pre-treatment features represent the information about the patient. As an example, [MDL] stands for Gender = Male, Location = Downtown, and Cultural Level = Low.

Table 2 Features of the (SMART) design schematic for DTR

This model is further divided into three stages. Stage-1 comprises s0, stage-2 comprises on {s1,s2, s3ands4} and the final-stage contains {s5ands6} states. Two actions (PSY and MED) are possible at stage-1 and four actions(PSY, MED, TM, PSY+MED ) are possible at stage-2, as shown in the Table 2. The actions (treatments) selected at stage-1 depend on the pre-treatment features, whereas the actions at stage-2 depend on the response of the patient (e.g., Responder or Non-Responder). The final stage contains only the terminal states s5 and s6, which represent the outcomes of the treatment (success or failure).

5.2 Dataset description

In order to validate the proposed approach, we have generated a dataset by adopting the SMART method [3, 27, 28].

The referred DTRs are shown in Fig. 3. In this trial, each participant is randomly assigned to one of two possible initial treatments: psychology (PSY ) or medicine (MED). As already described in a previous section, the participants are classified as Non-Responders (NR) or Responders (Res) to the initial treatment according to whether they do (or do not) experience more than two heavy-drinking days during the two-month time period.

There are 12 possible combinations of pre-treatment features see Table 2. We have selected 500 demonstrations(DTR trials) for each of these combinations. An example of the demonstration for the pre-treatment parameter M-O-L is; “[s0, PSY, s2, MED, s5]”, where s2 and s5 represent the “Responder” and “Success” states, respectively (see Fig. 3).

5.3 Results

Figure 4 shows the difference between the expert feature expectations and the estimated feature expectations along with the number of iterations as reported in the (18). The results have been obtained with a discount factor γ = 0.01. As an example, we considered the demonstrations of only three pre-treatment parameters (M-D-M, M-D-L, and M-D-H) to plot this graph. The trajectories in the graph show that we are on the right track in terms of learning the real reward function since, with each iteration, the difference between the expert feature expectations and the estimated feature expectations gets smaller.

Fig. 4
figure 4

Values of threshold at different numbers of iterations

These trends show the quick convergence behavior of the algorithm. The final difference value reduces below the threshold in less than 30 iterations; i.e., the last estimated reward values generate feature expectations very close to those generated by the expert policy.

On the other hand, the expected rewards at each state are estimated by exercising the proposed model for a given number of demonstrations, as shown in Table 3. These rewards provide the best representation of the demonstration for each pre-treatment feature. Consider “M-O-L” as the pre-treatment features. There were three types of repeated demonstration for “M-O-L” with the following shares:

Table 3 Rewards collected at each state against demonstrations

[s0, MED, s4, PSY, s5] = 32 %

[s0, MED, s3, TM, s5] = 37 %

[s0, MED, s4, PSY +MED, s5] = 31 %

There were some negative demonstrations that ended at state s6 (the failure state). We only selected the positive demonstrations which end at state s5 (the success state). The estimated rewards at each state for these demonstrations are presented in Table 3.

Next, we aimed to test how good these reward functions are. To achieve that objective, we used RL (the V alueIteration) algorithm, with the rewards already estimated through the proposed model. V alues for the different pre-treatment features were obtained by RL algorithm and are listed in Table 4. These values represent the preference of the RL agent at each stage. For example, let’s consider a patient with the characteristics MOL. In stage-1, the value of the treatment MED is greater than the value of the treatment PSY. Therefore, MED is preferred at stage-1. Similarly, at stage-2 there are four possible actions and among these, TM has the highest value. Therefore, TM is the preferred action in stage-2. Similarly, Success has the highest value in the final stage. Hence, the reproduced trajectory through the RL algorithm by using the estimated reward function is:

$$ [s_{0}, \text{MED}, s_{3}, \text{TM}, s_{5}] $$

This reproduced trajectory is the same as that given in the demonstrations with the highest percentage share. This means that the estimated reward function is the best explanation for these demonstrations. All the other rows in the Table 4 represent the values of each action for every pre-treatment feature.

Table 4 Values at each state collected by the Value Iteration algorithm

In case, the RL-agent selecting MED at stage-1, then all the possible actions it can take at the second stage are PSY, TM and PSY + MED (see Fig. 3). By considering such a case, the aggregated values of each possible combination of actions at both stages are represented in Fig. 5. “MED PSY ” meaning that the MED and PSY actions are chosen at stage-1 and stage-2, respectively. The graph shows that MED TM has a higher aggregated value for all the pre-treatment features compared with MED PSY and MED PSY + MED.

Fig. 5
figure 5

Aggregated V alues considering MED as an action at Stage-1

On the other hand, Fig. 6 represents the aggregated values considering PSY action at stage-1. All the possible actions at stage-2 that the RL agent can take for this case are MED, TM and PSY + MED. The graph shows that the PSY TM actions, respectively at stage-1 and stage-2, provide the highest aggregated values for all the pre-treatment features. Note that these trends in Figs. 5 and 6 represent the policies of the RL algorithm.

Fig. 6
figure 6

Aggregated V alues considering PSY as an action at Stage-1

6 Testing estimated rewards on existing models

As discussed earlier, it is very hard to find the true rewards for many RL problems, especially in a large state space. In many problems, researchers have to start with random rewards, that are difficult to tune in complex environments. Q-learning [29] and SARSA [30] are two important RL algorithms that are used in health care and personalized medicine. Usually, in existing models, rewards are randomly assigned to each state at the beginning. On the other hand, in the proposed approach, we estimate the best fitting rewards for each state.

In this section, we present a learning comparison of both (Q-learning and SARSA) models, using the existing approach with the estimated rewards via the proposed algorithm. A convergence performance of both RL models on the same DTR problem is presented in Fig. 7.

Fig. 7
figure 7

Figures show the learning curves of RL models by using both existing and proposed techniques. In the existing technique, rewards are randomly assigned to each state, while in the second case we used the rewards that were learned through the proposed IRL method

The learning curves of the Q-learning model, for both cases, are presented in Fig. 7(a). It can be seen that learning from the “IRL rewards” is faster than learning from the “existing approach”. With the IRL rewards, almost 415 episodes were needed to learn the DTR environment, while in the other case almost 809 episodes were required.

Similarly, the case of learning the DTR environment by adopting SARSA model is presented in Fig. 7(b). Once again, the learning curve is much faster in the case where the rewards obtained through the proposed algorithm are being used. It took almost 712 episodes to learn the SARSA model when the “existing approach” is used; whereas, it took almost 252 episodes when the “IRL rewards” were used. Both the models performed much better in terms of learning with the rewards that were obtained through the proposed IRL approach.

It is important to note that speeding up the learning process of a RL model is crucial in the case of a DTR. As already mentioned in the previous sections, statistical approaches (e.g., SMART) perform the execution of DTR trials (demonstrations). After the trials are completed, one can train an intelligent agent to get support in decisions making. However, to train a RL model from scratch takes many new “attempts”, the number of which can be reduced by using the proposed approach. To conclude, the proposed approach can be effectively adopted:

  1. 1.

    to automatically estimate the ”value” of each DTR (which helps in finding the optimal DTR); and

  2. 2.

    to reduce the time taken and the number of attempts needed to train the RL model.

7 Related work

The application of Machine Learning (ML) techniques in biomedicine [31], minimizing medication errors during home treatment [32, 33], risk management [34], communication [35,36,37] and healthcare [38,39,40] has been increased extensively in recent years. DTRs [41, 42] oversimplify personalized medicine to time-varying treatment settings in which the treatment is frequently tailored to a patient’s dynamic-state. DTRs are alternatively known as adaptive treatment strategies [43,44,45] or treatment policies [46, 47]. Behavior Cloning (BC) and RL [48, 49] are two methods exploited to learn DTRs.

BC [50] learns the policy through supervised learning by the direct mapping of states to actions. It can avoid interacting with the environment. However, without considerable improvement during the training, BC introduces a compounding error [51] over the length of the trajectories. BC can effectively recover the doctor’s policies when the Electronic Health Record (EHR) is optimal and plentiful. However, these methods are not suitable if the ground truth of the treatment is unclear [7].

On the other hand, the RL [52] and Deep Reinforcement Learning (DRL) [53] methods are based on maximizing the long-term reward of the patients [14, 48] from directly learning a policy. However, the learned policy is highly reliant on the accuracy of the pre-defined reward function. The true reward function is very crucial in RL in terms of finding the optimal DTR. In addition, these models are not explainable to an expert domain, due to the lack of true reward function, which is important in this sensitive application.

8 Conclusions and future directions

In this paper, we have proposed a PB-IRL approach to learn the underlying reward function for the automatic analysis of the DTRs from their trials (demonstrations). The reward function explains the behavior of experts (physicians) more effectively in comparison with RL policies. We have also shown that a RL-based decision support system can be directly and efficiently trained by adopting this reward function.

The reward function for demonstrations is estimated by learning the appropriate policies. Such policies can easily be found by mixing all the estimated policies corresponding to their mixture weights (as explained in Section 4). In a large state space, it is an easy and rapid way to obtain an optimal policy and hence the true reward function for the given demonstrations. In order to validate the approach, we have generated some demonstrations, applied the proposed algorithm, and estimated the reward function for these demonstrations. In order to check how good these estimated rewards are, we have exercised the V alue iteration RL algorithm and learned the policies. By comparing these learned policies with the demonstrated policies we have proved the accuracy of the algorithm. The results show that the proposed algorithm provides better dynamic treatment regimes and helps the existing models to learn the environment fast.

The proposed IRL technique mitigates the problems associated with the RL, but there are some limitations to be aware of.

  1. 1.

    The proposed technique can be applied to the problems where we have the experts trajectories/demonstrations e.g., autonomous car driving: where we have the trajectories of taxi drivers or DTRs: where we have the Doctor policies.

  2. 2.

    The IRL algorithms often make strong assumptions by considering any observed behavior as optimal. It can be challenging in a real-time environment.

IRL methods only consider the positive demonstrations or trajectories (i.e., those that result in the recovery or the survival of the patients) and learn a policy to recover these trajectories. The information embedded in negative trajectories (e.g. those that result in unsuccessful treatments) has been largely ignored. However, it could potentially help the learned policy to avoid repeating mistakes. Both positive and negative trajectories can be used to deduce the best practice and avoid errors in dynamic treatment regimes. These techniques, therefore, may increase the likelihood of patient survival through the exploitation of information from both positive and negative trajectories and provide better dynamic treatment regime.