1 Introduction

A dynamic treatment regime is a decision rule, or set of decision rules, which determines how a treatment should be assigned to a patient over time. Typically a patient is observed at regular intervals, and at each visit a treatment decision or action A is made in response to measurements of state S taken at that visit together with the history of previous decisions and measurements. An optimal dynamic treatment regime is one which maximises an overall outcome Y measured at the end of a sequence of visits.

Since the seminal work of Murphy [15] there has been growing interest in biostatistical applications of decision rule methodology. Recent work includes Arjas and Saarela [1], Dawid and Didelez [6], Moodie et al. [13], Zhao et al. [21, 22]. The focus of most work has been on testing for treatment effects, typically for binary A and with rather few measurement times. Even in very simple circumstances there can be severe statistical challenges in this area (Chakraborty et al. [3]; Hernan et al. [10]; Moodie and Richardson [14]; Zhang et al. [20]).

Motivated by an application on anticoagulation, we suppose the treatment decision A is essentially continuous rather than categorical, and our interest is in estimation of optimal decisions rather than testing. We concentrate on the regret functions proposed by Murphy [15], which are defined in Sect. 2 and form a particular case of the so-called advantage learning class of approaches. A variety of methods have been proposed for estimation from observational or trial data (e.g. Moodie et al. [12]; Almirall et al. [2]; Henderson et al. [8]; Zhang et al. [20]; Zhao et al. [21, 22]). Some of these rely on knowledge or assumptions on the process by which decisions on treatment A are reached, which is straightforward for a randomised trial, and some of which rely on modelling the evolution of the states S as time proceeds. A particular case of the former is the g-estimation procedure proposed by Robins [17], and beautifully summarised by Moodie et al. [12]. A special case of the latter is the so-called regret-regression approach that was proposed independently by Almirall et al. [2] and Henderson et al. [8]. These methods all formulate the problem in terms of the structural nested mean models (SNMMs) described by Robins [17]. An alternative approach based on marginal structural models has been proposed by Orellana et al. [16], which allows the estimation of simple dynamic treatment rules. For example, the decision when to start a treatment may be based on state measurements progressing beyond a threshold, which must be determined. We will focus on the SNMM approaches in this paper.

An estimation method is doubly robust if it gives consistent parameter estimates whenever either the state mechanism S or the action process A has been modelled correctly. The g-estimation method is founded, as stated, on knowledge of the decision or action process A. If there is also assumed knowledge of the state S mechanism then a doubly robust form can be constructed (Robins [17]). It is of interest therefore to ask whether a doubly robust form of the regret-regression approach can be found. In Sect. 3 below we propose such a modification and we show how it is closely linked to doubly robust g-estimation. In Sect. 4 we use simulation to compare performance of various methods in terms of efficiency and robustness, and in Sect. 5 we illustrate use in treatment of patients on long term anticoagulation therapy.

2 Modelling Dynamic Treatment Regimes

We assume that we have data from n independent individuals, each observed according to the same visit schedule consisting of K visits. At visit j, measurements are taken which define the current state S j of the patient and a treatment decision A j is made. After K visits an outcome Y is measured. Our aim is to use the observed data to determine the optimal dynamic treatment regime to maximise the outcome Y. As an illustration we will use data from a study investigating patients taking the anticoagulation treatment warfarin to avoid abnormal blood clotting. Here measurements of blood-clotting potential are taken at each visit, defining S j , and a dose of warfarin is prescribed, defining the action A j . The final outcome Y is the time spent with blood-clotting time within a target range over the entire course of follow-up.

Taking a potential outcomes (or counterfactual) approach (see for example Greenland et al. [7]), let be the set of all possible actions that could be taken at visit j, and let be the set of all possible treatment regimes up to visit j. For , \(\bar{S}_{j}(\bar{a}_{j-1}) = (S_{1}, S_{2}(a_{1}),\ldots ,S_{j}(\bar{a}_{j-1}))\) denotes the potential state history under the treatment regime \(\bar{a}_{j-1}\). Similarly, \(Y(\bar{a}_{K})\) denotes the potential outcome under the treatment regime .

We make the consistency assumption that the observed state history \(\bar{S}_{K} = (S_{1}, \ldots, S_{K})\) is equal to the potential state history \(\bar{S}(\bar{a}_{K-1})\) under the observed treatment regime \(\bar{a}_{K} = \bar{A}_{K}=(A_{1},\ldots,A_{K})\) and that the observed outcome Y is equal to the potential outcome \(Y(\bar{a}_{K})\) under the observed treatment regime \(\bar{a}_{K} = \bar{A}_{K}\). In short, this means that the method by which treatments are assigned does not affect the values of the future states or the outcome (see Cole and Frangakis [5], for a thorough discussion of the consistency assumption). Throughout this paper we will therefore replace potential outcomes notation, e.g. \(E(Y|\bar{S}(a_{K-1}),\bar{a}_{K})\) for the expected value of the potential outcome \(Y(\bar{a}_{K})\) conditional on the treatment regime \(\bar{a}_{K}\) and potential state history \(\bar{S}(\bar {a}_{K-1})\), with the observed outcomes notation \(E(Y|\bar{S}_{K},\bar{A}_{K})\).

We also make the assumption of no unmeasured confounders, which means that the choice of treatment to be received does not depend on potential future states or the potential outcome except through observed state and treatment history. When no drop-out occurs this assumption is equivalent to exchangeability. It enables us to estimate causal effects from observational data (see Hernán and Robins [11], for a discussion of the exchangeability assumption). We make a third assumption of positivity, that the optimal treatment regime has a positive probability of being observed in the data or, in the case of a continuous treatment, that it is identifiable from the observed data (see Cole and Hernán [4], for a discussion of positivity and Henderson et al. [8], for the extension in the continuous case). All three assumptions are standard in causal inference.

Let \(\bar{S}_{j}=(S_{1},\ldots,S_{j})\) be the observed measurement history up to and including visit j, and \(\bar{A}_{j}=(A_{1},\ldots,A_{j})\) be the history of actions taken up to visit j. A dynamic treatment regime d is defined by a set of decision rules, \(d = (d_{1}(S_{1}), \ldots, d_{j}(\bar{S}_{j}, \bar{A}_{j-1}), \ldots, d_{K}(\bar{S}_{K},\bar {A}_{K-1}))\), which prescribe an action to be taken at each visit given all information available at the time of the visit, including the current state S j . The optimal dynamic treatment regime \(d^{\operatorname{opt}}\) is the one which optimises the expected value of the outcome Y.

A naive approach to modelling the outcome would be to regress Y on state history \(\bar{S}_{K}\) and action history \(\bar{A}_{K}\). However, this ignores the potential effect of previous actions \(\bar{A}_{j-1}\) and states \(\bar{S}_{j-1}\) on the current state \(\bar{S_{j}}\). Including the state S j in the analysis may introduce bias because action history \(\bar{A}_{j-1}\) and state history \(\bar{S}_{j-1}\) may influence both the current state S j and the outcome Y.

This problem can be solved by modelling quantities which isolate the causal effect of treatment A j on Y (see Hernán [9], for a discussion of the use of causal effects in causal inference). Murphy [15] proposed the use of regret functions which measure the expected decrease in Y due to an action a j taken at time j compared to the optimal action, given that optimal actions are used in the future. The regret at time j is defined by

$$\begin{aligned} \mu_j(a_j | \bar{S_j}, \bar{A}_{j-1}) &= E\bigl(Y\bigl(a_1,\ldots ,a_{j-1},d^{\operatorname{opt}}_j,\ldots,d^{\operatorname{opt}}_K \bigr)\big|\bar{S}_j,\bar{a}_{j-1}=\bar {A}_{j-1}\bigr) \\ &\quad - E\bigl(Y\bigl(a_1,\ldots,a_j,d^{\operatorname{opt}}_{j+1}, \ldots,d^{\operatorname{opt}}_K\bigr)\big|\bar{S}_j,\bar {a}_{j-1}=\bar{A}_{j-1}\bigr). \end{aligned}$$
(1)

As an alternative Robins [17] suggested using a blip function which compares actions to a reference action a 0. The blip measures the expected change in Y when action a j is taken at time j compared to a 0, assuming future actions are a 0,

$$\begin{aligned} \gamma_j(a_j | \bar{S_j}, \bar{A}_{j-1}) &= E\bigl(Y\bigl(a_1,\ldots ,a_{j-1},d^{0}_j,\ldots,d^{0}_K \bigr)\big|\bar{S}_j,\bar{a}_{j-1}=\bar{A}_{j-1}\bigr) \\ &\quad - E\bigl(Y\bigl(a_1,\ldots,a_j,d^{0}_{j+1}, \ldots,d^{0}_K\bigr)\big|\bar{S}_j,\bar {a}_{j-1}=\bar{A}_{j-1}\bigr), \end{aligned}$$
(2)

where the reference regime d 0 specifies that all actions are set to a 0.

It has been argued by Robins [17] that correct models can be specified more easily for blip functions because a comparison to a reference regime can be envisaged more readily by clinicians than a comparison to an unspecified optimal regime. However, determining the optimal regime from models for the blip functions can be computationally challenging, whereas the optimal action \(a_{j}^{\operatorname{opt}}\) immediately follows from the form of the regret function because by construction \(\mu_{j}(a_{j}^{\operatorname{opt}}|\bar{S_{j}}, \bar{A}_{j-1})=0\). Also, because the form of the optimal treatment immediately follows from the form of the regret function, the use of regrets enables us to restrict our attention to decision rules with simple forms (see also Rosthøj et al. [19]). For these reasons we will use regret functions in the rest of this paper.

3 Estimating Optimal Dynamic Treatment Regimes

Two methods which can be used to estimate the optimal dynamic treatment regime are g-estimation (Robins [17], see also Moodie et al. [12]) and regret-regression, which was proposed independently by Henderson et al. [8] and Almirall et al. [2].

3.1 G-estimation

In order to estimate an optimal dynamic treatment regime using g-estimation, we must first specify models for the regret functions \(\mu _{j}(a_{j} | \bar{S_{j}}, \bar{A}_{j-1})\). The form of the regret functions determines the form of the optimal treatment rules. Hereafter, when we refer to an optimal decision rule we therefore mean the decision rule of the specified form which optimises the expected outcome. Models for μ j may depend on parameters ψ, which may be shared across time-points. See Moodie et al. [12] and Moodie et al. [13] for examples of models with different parameters at different time-points. We then define

$$H_j = H_j(\psi) = Y + \sum _{k\geq j} \mu_k(A_k | \bar{A}_{k-1},\bar {S}_k;\psi), $$

which provides an estimate of the expected outcome in the counterfactual event that optimal decisions are followed from time j onwards (Robins [17]; Moodie et al. [12]). For conciseness we shorten \(\mu_{j}(A_{j} | \bar{A}_{j-1},\bar{S}_{j};\psi)\) to μ j for the remainder of this paper.

We also specify models for the probability density \(f(a_{j}|\bar{S}_{j},\bar {A}_{j-1})\) for the assigned value of the action A j , conditional on state and action history, and for \(E(H_{j}|\bar{S}_{j},\bar{A}_{j-1})\). We can then form the g-estimation equations

$$\begin{aligned} EE^{\mathit{GE}} (\psi) & = \sum_{j=1}^K \bigl( H_{j} - E(H_{j}|\bar{S}_{j},\bar {A}_{j-1}) \bigr) \bigl( g_j(A_{j}| \bar{S}_j,\bar{A}_{j-1}) \\ &\quad - E_{A_{j}} \bigl(g_j(A_{j}| \bar{S}_j,\bar{A}_{j-1}) \bigr) \bigr) \end{aligned}$$
(3)

for some functions \(g_{j}(A_{j}|\bar{S}_{j},\bar{A}_{j-1})\) of the same dimension as ψ. It has been shown that solutions \(\hat {\psi}^{\mathit{GE}}\) to E(EE GE(ψ))=0 provide consistent estimates of ψ if the regret functions are correctly modelled and either the model specified for \(f(a_{j}|\bar{S}_{j},\bar{A}_{j-1})\) or the model specified for \(E(H_{j}|\bar{S}_{j},\bar{A}_{j-1})\) is correct (Robins [17]). We give a simpler proof in Appendix A.1. G-estimation is therefore doubly robust in the sense discussed in the Introduction.

A simple choice for the functions g j (A j ) is (Moodie et al. [12]):

$$g^{\operatorname{simp}}_j(A_j | \bar{S}_j, \bar{A}_{j-1}) = E \biggl( \frac{\partial\mu _j}{\partial\psi} \bigg| \bar{S}_j, \bar{A}_j \biggr), $$

which can be calculated easily from the μ j (ψ). The alternative

$$\begin{aligned} g^{\operatorname{eff}}_j(A_j | \bar{S}_j, \bar{A}_{j-1}) = & E \biggl( \frac{\partial H_j}{\partial\psi} \bigg| \bar{S}_j, \bar{A}_j \biggr) \\ = & E \biggl( \sum_{k\geq j} \frac{\partial\mu_k}{\partial\psi} \bigg| \bar{S}_j, \bar{A}_j \biggr) \end{aligned}$$
(4)

gives Robins’ [17] locally efficient semiparametric estimator of ψ. While \(g^{\operatorname{eff}}\) has been shown to be more efficient than \(g^{\operatorname{simp}}\) (Robins [17]), it can be more complicated to calculate because it requires expected values of μ k conditional on \((\bar{S}_{j},\bar {A}_{j})\) for k>j. In turn these require conditional expectations of (functions of) all S k and A k for k>j and hence detailed knowledge of both state and action evolution processes.

3.2 Regret-regression

Murphy [15] showed that \(E(Y|\bar{S}_{K}, \bar{A}_{K})\) can be decomposed into a sum of regret functions μ j and nuisance functions \(\phi _{j}(S_{j}|\bar{S}_{j-1}, \bar{A}_{j-1})\) as follows:

$$ E(Y|\bar{S}_K,\bar{A}_{K}) = \beta_0 + \sum _{j=1}^K \phi_j (S_j|\bar {S}_{j-1}, \bar{A}_{j-1}) - \sum _{j=1}^K\mu_j( A_j | \bar{S}_j,\bar{A}_{j-1}). $$
(5)

The nuisance function \(\phi_{j} (S_{j}|\bar{S}_{j-1}, \bar{A}_{j-1})\) for j≥2 is defined to be

$$\begin{aligned} \phi_j (S_j|\bar{S}_{j-1}, \bar{A}_{j-1}) &= E\bigl(Y\bigl(a_1,\ldots ,a_{j-1},d^{\operatorname{opt}}_j,\ldots,d^{\operatorname{opt}}_K \bigr)\big|\bar{S}_j,\bar{A}_{j-1}\bigr) \end{aligned}$$
(6)
$$\begin{aligned} &\quad - E\bigl(Y\bigl(a_1,\ldots,a_{j-1},d^{\operatorname{opt}}_j, \ldots,d^{\operatorname{opt}}_K\bigr)\big|\bar {S}_{j-1}, \bar{A}_{j-1}\bigr), \end{aligned}$$
(7)

with \(\phi_{1}(S_{1}) = E(Y(d^{\operatorname{opt}}_{1},\ldots,d^{\operatorname{opt}}_{K})|S_{1}) - E(Y(d^{\operatorname{opt}}_{1},\ldots,d^{\operatorname{opt}}_{K}))\). The function \(\phi_{j}(S_{j}|\bar {S}_{j-1}, \bar{A}_{j-1})\) expresses the change in the expected value of Y due to the measurement of S j when optimal decision rules are used in the future. Note that \(E_{S_{j}}(\phi_{j}(S_{j}|\bar{S}_{j-1}, \bar {A}_{j-1})) = 0\) follows from the definition of \(\phi_{j}(S_{j}|\bar {S}_{j-1}, \bar{A}_{j-1})\). Note also that the decomposition (5) requires the nuisance and regret functions to be defined as differences of expectations under the assumption that optimal policies are followed at future time-points. There is no similar decomposition with non-negative {μ j } based on a comparison with non-optimal policies, such as the blip functions suggested by Robins [17] (see Appendix B).

The decomposition (5) can be used to estimate regret parameters ψ if models are specified for the \(\phi_{j}(S_{j}|\bar {S}_{j-1}, \bar{A}_{j-1})\) (Henderson et al. [8]; Almirall et al. [2]). To satisfy the condition \(E_{S_{j}}(\phi_{j}(S_{j}|\bar{S}_{j-1}, \bar {A}_{j-1})) = 0\), Henderson et al. [8] suggested the form

$$\phi_j (S_j|\bar{S}_{j-1}, \bar{A}_{j-1}) = \beta_{j}^T(\bar {S}_{j-1},\bar{A}_{j-1}) \bigl(S_{j} - E(S_j|\bar{S}_{j-1},\bar{A}_{j-1})\bigr), $$

where \(\beta_{j}^{T}(\bar{S}_{j-1},\bar{A}_{j-1})\) is a coefficient which may depend on the state and action history before time j. Under this approach a model must be specified for \(E(S_{j}|\bar{S}_{j-1},\bar {A}_{j-1})\). Parameters can be estimated using least squares, which is equivalent to solving E(EE RR(ψ))=0, where EE RR(ψ) are the regret-regression estimating equations

$$ EE^{\mathit{RR}}(\psi) = \bigl(Y - E(Y|\bar{S}_{K}, \bar{A}_{K})\bigr) \sum_j \frac{\partial \mu_{j}}{\partial\psi}. $$
(8)

A proof is given in Appendix A.2 that the resulting estimates \(\hat{\psi}^{\mathit{RR}}\) are consistent estimates for ψ provided the regret functions \(\mu_{j}(a_{j} | \bar{S_{j}}, \bar {A}_{j-1})\) and the nuisance functions \(\phi_{j}(S_{j}|\bar{S}_{j-1}, \bar {A}_{j-1})\) have been modelled correctly.

A natural question to ask is whether we can formulate a doubly robust version of regret-regression, which is robust to misspecification of either \(\phi_{j}(S_{j}|\bar{S}_{j-1}, \bar{A}_{j-1})\) or the probability density \(f(a_{j}|\bar{S}_{j},\bar{A}_{j-1})\) of assigning action A j . A naive extension of the estimating equations (8) would be

$$ EE^{\operatorname{naive}}(\psi) = \bigl(Y - E(Y|\bar{S}_{K}, \bar{A}_{K})\bigr) \sum_j \biggl( \frac {\partial\mu_{j}}{\partial\psi} - E_{A_j} \biggl(\frac{\partial\mu_{j}}{\partial\psi} \biggr) \biggr). $$
(9)

However, the resulting estimates \(\hat{\psi}^{\operatorname{naive}}\) are not consistent if the \(\phi_{j}(S_{j}|\bar{S}_{j-1}, \bar{A}_{j-1})\) are misspecified because when we take the expectation over Y the left bracket of (9) retains some dependence on A j for j=1,…,K−1 (see Appendix A). However, we obtain consistent estimates with the double-robustness property if we replace the sum in (9) with the contribution just from the final term:

$$ EE^{\mathit{DRRR}}(\psi) = \bigl(Y - E(Y|\bar{S}_{K}, \bar{A}_{K})\bigr) \biggl(\frac {\partial\mu_{K}}{\partial\psi} - E_{A_K} \biggl( \frac{\partial\mu_{K}}{\partial\psi} \biggr) \biggr). $$
(10)

The estimators \(\hat{\psi}^{\mathit{DRRR}}\) derived from (10) will be consistent because E(EE DRRR)=0.

Note that

So the doubly robust regret-regression estimating equations (10) are identical to the final (j=K) term of the g-estimating equations (3) with \(g_{j} = g^{\operatorname{simp}}_{j}\) when \(E(Y|\bar{S}_{K},\bar{A}_{K})\) is modelled in the same way. Specification of \(E(H_{j}(\psi)|\bar {S}_{j},\bar{A}_{j-1})\) is equivalent to specification of the nuisance functions \(\phi_{j}(S_{j}|\bar{S}_{j-1}, \bar{A}_{j-1})\) for regret-regression because

$$E\bigl(H_j(\psi)|\bar{S}_j,\bar{A}_{j-1} \bigr) = \beta_0+ \sum_{k=1}^{j} \phi_k (S_k|\bar{S}_{k-1}, \bar{A}_{k-1}) - \sum_{k=1}^{j-1} \mu_k( A_k | \bar{S}_k,\bar{A}_{k-1};\psi), $$

(see Appendix A.1). It may be difficult to identify an appropriate model for either \(E(H_{j}(\psi)|\bar{S}_{j},\bar {A}_{j-1})\) or \(\phi_{j}(S_{j}|\bar{S}_{j-1}, \bar{A}_{j-1})\), and the choice of which to specify is likely to depend on the context. See Henderson et al. [8] and Rosthøj et al. [19] for further discussion about modelling \(\phi_{j}(S_{j}|\bar{S}_{j-1}, \bar{A}_{j-1})\). We recommend taking the models to be as general as possible, see Sect. 5 for an example. Since these models are not of direct interest, it is safer to err on the side of overfitting (Henderson et al. [8]). We will show via simulation studies in Sect. 4 that restricting to the final term in this way results in a loss of precision for \(\hat{\psi}^{\mathit{DRRR}}\) compared to \(\hat {\psi}^{\mathit{GE}}\).

4 Simulation

We demonstrate the behaviour of \(\hat{\psi}^{\mathit{GE}}\) with \(g_{j}=g_{j}^{\operatorname{simp}}\), \(\hat{\psi}^{\mathit{GE}}\) with \(g_{j}=g_{j}^{\operatorname{eff}}\), \(\hat{\psi }^{\mathit{RR}}\) and \(\hat{\psi}^{\mathit{DRRR}}\) using a simulation study. We generated data from 1000 patients, followed-up over 5 time-points. States were normally distributed with E(S 1)=0.5, E(S j |A j−1)=(0.5−A j−1) for j>1 and residual variance \(\sigma^{2}_{s}=1\). Actions were generated as A j U(1.25,3) when S 1>0.5 and A j U(0,1.75) when S 1≤0.5. By definition μ j is non-negative, so regret functions were taken to be quadratic with \(\mu _{j}(a_{j}|\bar{S}_{j},\bar{A}_{j-1}) = \psi_{1} (a_{j} - \psi_{2} S_{j})^{2}\), with ψ 1=6 and ψ 2=2. The optimal action at visit j, \(a_{j}^{\operatorname{opt}}\), is the action satisfying \(\mu_{j}(a_{j}^{\operatorname{opt}}|\bar{S}_{j},\bar {A}_{j-1})=0\), giving \(a_{j}^{\operatorname{opt}}= \psi_{2} S_{j}\). Note that the optimal action may be negative, even though the observed actions are always positive. In practice this would mean that estimated optimal actions had been extrapolated to a region of that had not been observed in the data. Such an extrapolation would only be appropriate if regret functions had been modelled correctly. Outcomes Y were normally distributed with

$$E(Y|\bar{S}_K,\bar{A}_K) = 30 - 5 \bigl(S_1 - E(S_1)\bigr) - \sum_{j=2}^5 (5 + 2 A_{j-1}) \bigl(S_j - E(S_j|A_{j-1}) \bigr) - \sum_{j=1}^{5} \mu_j $$

and variance \(\sigma_{y}^{2}=1\).

For both g-estimation and regret-regression, parameters were estimated using a two-stage process. In the first stage the model for the state distribution was fitted to the observed states and, if required, the model for assigning actions was fitted to the observed actions. For regret-regression these models were then used to estimate the residuals \(Y-E(Y|\bar{S}_{K},\bar{A}_{K})\) using the decomposition (5), and parameters estimated using least squares. For all other methods the models were used to determine the corresponding estimating equations, which were solved numerically. Standard errors were calculated using bootstrapping with 100 bootstrap samples.

Parameter estimates \(\hat{\psi}\) were obtained using correctly and incorrectly specified models for S j and A j . The misspecified model for S j assumed S j N(0.5,1), and so ignored the dependence of the states on the previous action. In the misspecified action model the actions were assumed to be uniformly distributed between 0 and 3.

Table 1 shows results for ψ 2, which is the parameter of most interest since it determines the optimal dose. Results for other parameters are not reported, but lead to similar conclusions. Coverage probability is estimated by the proportion of simulations for which the estimated confidence interval contains the true parameter value. Parameter estimates were discarded when convergence was not achieved.

Table 1 Simulation results for ψ 2 using regret-regression (RR), doubly-robust regret-regression (DRRR), g-estimation with \(g=g^{\operatorname{simp}}\) (GE SIMP) and g-estimation with \(g=g^{\operatorname{eff}}\) (GE EFF). Reported are means of parameter estimates with standard deviation of parameter estimates in brackets, means of estimated standard errors, coverage probability, root-mean-square error and the number of simulated data sets for which convergence was achieved. Results are based on 1000 samples of size n=1000

When models for both S j and A j were specified correctly, parameter estimates were consistent using all estimation methods. The most efficient method was RR. For GE SIMP estimated standard errors tended to be too high, leading to over-coverage of confidence intervals.

When the model for S j is misspecified, RR results are slightly biased, with none of the estimated confidence intervals containing the true parameter value ψ 2=2. All other methods are robust to misspecification of the state model, and gave consistent parameter estimates. The GE EFF estimating equations for this scenario are identical to the GE SIMP estimating equations because the incorrect model for S j has been used when calculating expressions for the \(g_{j}^{\operatorname{eff}}\); because the misspecified model for S j is independent of A j−1, only the term involving μ j in (4) depends on A j , and all other terms therefore cancel when subtracting \(E_{A_{j}} (g_{j}(A_{j}|\bar{S}_{j},\bar {A}_{j-1}))\) from \(g_{j}(A_{j}|\bar{S}_{j},\bar{A}_{j-1})\). The DRRR method was less efficient than GE SIMP and GE EFF. For all the methods overestimation of standard errors gave over-coverage of confidence intervals.

When the model for A j is misspecified, all methods give consistent parameter estimates. For RR this is because the method does not depend on the model for A j , and all other methods are robust to misspecification of the action model. Again, the most efficient method is RR.

When models for both S j and A j are misspecified, none of the methods would be expected to give consistent parameter estimates. Here all methods gave biased results, with DRRR parameter estimates being the most biased. RR has the smallest root mean squared error, with similar bias but smaller standard errors compared to GE SIMP. Parameter estimates from misspecified models took longer to converge, as indicated by the low convergence rates.

The bias caused by misspecification of state and action models in our simulation study was smaller than might be expected from a previous simulation study (Almirall et al. [2]). This could be because we have focussed on a continuous treatment decision, whereas Almirall et al. considered only binary actions. Model misspecification in the Almirall et al. study was generated by multiplying estimated state values by random noise of varying amplitudes. In contrast, our simulation study aimed to explore model misspecifications that might occur in practice, such as omitting variables from the state and action models.

5 Example: Blood-Clotting

We illustrate the methods with data taken from 303 patients at risk of thrombosis who were receiving long-term anticoagulation therapy for abnormal blood-clotting. These data have been analysed previously by Rosthøj et al. [18] and by Henderson et al. [8]. The ability of the blood to clot was measured using the International Normalised Ratio (INR), with high values indicating that the blood clots too slowly, increasing the risk of haemorrhage, and low values indicating fast clotting-times with an increased risk of thrombosis. Each patient attended 14 clinic visits at which their INR was measured and their dose of anticoagulant was adjusted accordingly. The aim of therapy is to maintain a patient’s INR within a target range, which is pre-specified for each patient.

As an outcome for analysis we used the proportion of time over follow-up that was spent with the INR within target range. The final dose adjustment did not contribute to the outcome, and we treated the first four clinic visits as a stabilisation period, giving K=9. States S j are defined to be the standardised difference between the INR at the jth visit and the target range. Actions A j are defined to be the change in anticoagulant dose at the jth visit. With these definitions S j =0 for 50 % of state observations and A j =0 for 60 % of actions taken.

We modelled the regrets as quadratic functions, depending on the previous two states and the previous action:

$$\mu_j (a_j | \bar{S}_j, \bar{A}_{j-1}; \psi) = \psi_{1} (a_j - \psi_{2}S_j - \psi_{3}S_{j-1} - \psi_4 A_{j-1})^2. $$

To model the states we used a mixture model with logistic and normal components to account for the high number of zero states. Linear predictors for both models were allowed to depend on the previous four states and actions, as well as a number of interactions between them. The model for the actions was defined in the same way.

Parameters were estimated using RR, DRRR and GE SIMP, with standard errors by bootstrap with 1000 resamplings. We were unable to implement the more efficient method GE EFF because of the extra complexity introduced by the dependence of the regret functions on the previous state and the previous action. In this case no terms in \(g_{j}(A_{j}) - E_{A_{j}} (g_{j}(A_{j}) | \bar{S}_{j}, \bar{A}_{j})\) automatically cancelled, as was the case for the simulation study. So, for example, it would be necessary to calculate E(∂μ 9/∂ψ|S 1,A 1) by integrating out all other S j and A j . In this complicated scenario we found such calculations to be algebraically intractable.

Results are given in Table 2. Parameter estimates from RR, DRRR and GE SIMP are similar, although the RR results tend to favour slightly more extreme changes of dose than the GE SIMP results. The difference between RR and GE SIMP results could indicate some model misspecification, but standard errors are too large to draw any firm conclusions. The DRRR standard errors were substantially larger than the GE SIMP standard errors. We can therefore place most confidence in the GE SIMP parameter estimates because GE SIMP is the most efficient estimation method with the double-robustness property. Some bootstrap samples (3 out of 1000) did not converge using RR, and for others there was a tendency for ψ 1 to be estimated close to 0. This could explain the larger standard errors estimated for RR compared to GE SIMP.

Table 2 Results for the blood-clotting example using regret-regression (RR), doubly robust regret-regression (DRRR) and g-estimation with \(g=g^{\operatorname{simp}}\) (GE SIMP). Reported are estimated parameter values with standard errors in brackets

The estimates for ψ 2 indicate that the dose should be increased if the current state is too low and should be decreased if it is too high, as would be expected. Negative values of ψ 3 indicate that if the previous state is below range then the current dose should be adjusted upwards, and if it is above range then the current dose should be adjusted down. Similarly, estimates for ψ 4 indicate that if the previous dose was increased then the current dose should be reduced and vice versa. So, for example, a patient whose current INR measurement is S j =0.5, and who previously also had high INR, S j−1=0.5, and whose dose was reduced, A j−1=−0.5, would be recommended to reduce their dose by 1.44 according to the GE SIMP estimates. By comparison, a patient who also had S j =0.5, but whose INR was previously too low, S j−1=−0.5, resulting in an increase of dose A j−1=0.5, would be recommended to reduce their dose by a smaller amount of 0.80.

In summary, both methods give plausible parameter estimates, but RR standard errors seem large in comparison with GE SIMP standard errors. The simulation results suggest that standard errors estimated using GE SIMP could also be overly conservative.

6 Discussion

We have demonstrated that two methods which have been proposed for estimating optimal dynamic treatment regimes, regret-regression and g-estimation, are closely related. Formulating a doubly robust version of regret-regression led to a truncated version of the g-estimation equations.

The regret-regression approach is efficient when the model for states S j is correctly specified. No model for actions A j is required. G-estimation, on the other hand, can be applied when the action model is known, without the need to model states correctly. This is perhaps the best approach for trial data, where actions are randomised and hence fully understood. For observational data it may be the case that the natural process of state evolution is easier to model than the subjective actions chosen by health personnel. G-estimation is doubly robust in the sense that parameter estimates are consistent provided that either the states or the actions are modelled correctly. An assumption of no unmeasured confounders is necessary for inference in both cases.

Regret-regression outperforms efficient g-estimation even when the latter makes use of correct specification of both action and state models. However, it performs poorly when the state model is misspecified, whereas efficient g-estimation is robust. Given that the states are fully observed one can argue that careful attention to modelling and diagnostics should reduce or remove the risk of major misspecification. Nonetheless our recommendation is to attempt efficient g-estimation whenever possible. Unfortunately, as in the blood clotting application, when the regret and state models are fairly complex it can be difficult or in practice impossible to obtain the functions H j (ψ) defined at (1) that are required for implementation.

Biases resulting from model misspecification were smaller than might have been expected from a previous simulation study (Almirall et al. [2]). One difference here is that we have focussed on continuous rather than binary treatment decisions. It would be interesting to see if such small biases persist for other forms of regret functions and more complicated models. We have assumed throughout that regret functions have been specified correctly. We leave investigation of the effects of regret misspecification for future work.