Dynamic Treatment Regimes Using Bayesian Additive Regression Trees for Censored Outcomes

To achieve the goal of providing the best possible care to each individual under their care, physicians need to customize treatments for individuals with the same health state, especially when treating diseases that can progress further and require additional treatments, such as cancer. Making decisions at multiple stages as a disease progresses can be formalized as a dynamic treatment regime (DTR). Most of the existing optimization approaches for estimating dynamic treatment regimes including the popular method of Q-learning were developed in a frequentist context. Recently, a general Bayesian machine learning framework that facilitates using Bayesian regression modeling to optimize DTRs has been proposed. In this article, we adapt this approach to censored outcomes using Bayesian additive regression trees (BART) for each stage under the accelerated failure time modeling framework, along with simulation studies and a real data example that compare the proposed approach with Q-learning. We also develop an R wrapper function that utilizes a standard BART survival model to optimize DTRs for censored outcomes. The wrapper function can easily be extended to accommodate any type of Bayesian machine learning model.


Introduction
Optimizing medical therapy often requires that the treatment be individually tailored to the patient initially, and that the treatment be adaptive to patient's changing characteristics over time.Since patient responses can often be heterogeneous, it is challenging for physicians to customize treatments for patients based on traditional clinical trial results, which lack the ability to identify subgroups that have different treatment effects and rarely consider successions of treatments.For chronic diseases that can evolve, it is even more important and difficult to choose the best therapy in sequence.To give a simple example, oncologists typically choose an initial immunosuppressant regime for patients with acute myeloid leukemia (AML) who are undergoing allogeneic hematopoietic cell transplantation (AHCT), to prevent a serious potential complication called graft-versus-host disease (GVHD).At the time that such an initial regime fails, a salvage treatment is chosen based on the patient's prior treatments and responses.Such a multi-stage treatment decision has been summarized as a dynamic treatment regime (DTR) by Murphy Murphy (2003).Each decision rule in DTR takes a patient's individual characteristics, treatment history and possible intermediate outcomes observed up to a certain stage as inputs, and outputs a recommended treatment for that stage.
A number of approaches have been proposed for estimating and optimizing DTRs, including those by Robins Robins (2004), Moodie et al. Moodie et al. (2007), Qian and Murphy Qian and Murphy (2011), Zhao et al. Zhao et al. (2015), Krakow et al. Krakow et al. (2017), Murray et al. Murray et al. (2018), and Simoneau et al.Simoneau et al. (2020).Among the previous literature, the Bayesian machine learning (BML) method developed by Murray et al. Murray et al. (2018) innovatively bridges the gap between Bayesian inferences and dynamic programming methods from machine learning.A key advantage to a Bayesian approach to estimation is the quantification of uncertainty in decision making through the resulting posterior distribution.A second benefit that arises specifically in the BML approach is the highly flexible estimation that is employed, which minimizes the risk of estimation errors due to model mis-specification.
However, the BML method has not yet been adapted to censored outcomes, which is one of the more common types of outcomes in controlling chronic diseases.Motivated by the study of optimal therapeutic choices to prevent and treat GVHD, in this paper, we extend this approach to censored outcomes under the accelerated failure time (AFT) model framework.By modifying the data augmentation step in the BML method, the censored observations can be imputed in an informative way so that the observed censoring time is well utilized.This extension is illustrated using Bayesian additive regression trees (BART).We also implemented the proposed AFT-BML approach by developing an R function that utilizes standard BART survival software directly without needing to modify existing (complex) BART software directly.Parallel computing was used to speed up the computational calculations.This R wrapper function can be easily adjusted to accommodate other types of Bayesian machine learning methods.This paper is organized as follows.In Section 2, we briefly review related methods, algorithms, and describe the extended AFT-BML approach for optimizing DTRs.Section 3 presents simulation studies to demonstrate our model performance by comparing to it to estimation using Q-learning.An analysis of our motivating dataset of patients diagnosed with AML is given in Section 4. Finally, in Section 5 we discuss the advantages and disadvantages of our approach and provide some suggestions for future work.

Dynamic Treatment Regimes
A dynamic treatment regime (DTR) is a series of decision rules that assign treatment based on the patient's characteristics and history at each stage.Without loss of generality, we focus on a two-stage intervention problem.Furthermore, we start by describing DTRs in the non-survival setting, before proceeding to the censored survival setting later.Following Murray's notation Murray et al. (2018), let o 1 ∈ O 1 be the covariates observed before Stage 1, and a 1 ∈ A 1 be the action taken at Stage 1. Denote y 1 as the payoff observed after Stage 1 and before Stage 2. {o 2 , a 2 , y 2 } are defined similarly for Stage 2. The total payoff is assumed to be y = y 1 + ηy 2 , where η is an indicator that the patient entered Stage 2. A general diagram to present the two-stage decision making problem is Denote the accumulated history before Stage 2 treatment as ō2 = (o 1 , a 1 , y 1 , o 2 ) ∈ Ō2 .In this setting, a DTR consists of two decision rules, one for each stage, Optimizing the two-stage DTR (d 1 , d 2 ) is equivalent to finding the decision rules d 1 and d 2 that maximize the expected total payoff E(y).That is

Bayesian Machine Learning for DTRs
Murray et al.Murray et al. (2018) described a new approach called Bayesian Machine Learning (BML) to optimize DTRs; the method requires fitting a series of Bayesian regression models in reverse sequential order under the approximate dynamic programming framework.The authors use the potential outcomes notation to describe their approach, where y(a 1 , a 2 ) denotes the payoff observed when action a 1 is taken at Stage 1 and action a 2 is taken at Stage 2, and other potential outcomes (y 2 (a 1 , a 2 ), y 1 (a 1 ), and o 2 (a 1 )) are similarly defined.Assuming the potential outcomes are consistent, then the observed outcome corresponds to the potential outcome for the action actually followed, e.g.y 1 (a 1 ) = y 1 , o 2 (a 1 ) = o 2 , y 2 (a 1 , a 2 ) = y 2 , and y(a 1 , a 2 ) = y.The approach can be summarized as follows.).Note that if the observed outcome a 2 matches the optimal outcome according to d opt 2 , then the potential payoff is simply the observed payoff y = y 1 + ηy 2 .Otherwise, the potential outcome is unobserved and must be imputed (in this BML method, actually sampled from the posterior predictive distribution as described further below).Given imputed values, the Stage 1 regression model for y(a 1 , d opt 2 ) then can be estimated with observed covariates (o 1 , a 1 ) to identify d opt 1 .This type of backward induction strategy is used in several DTR estimation methods, including g-estimation, Q-learning, dynamic weighted ordinary least squares Robins (2004); Moodie et al. (2007);Nahum-Shani et al. (2012); Goldberg and Kosorok (2012); Simoneau et al. (2020).
Estimation of the terminal stage regression model is simply a typical model of outcome by predictors fit using standard Bayesian methods.The estimation of the nonterminal stage models, on the other hand, is not easily done with standard Bayesian software because of the counterfactual or potential payoff under the unobserved optimal action at each subsequent stage, which contributes to the outcome at the current stage.To address this problem, Murray et al.Murray et al. (2018) developed a backward induction Gibbs (BIG) sampler to implement the proposed BML approach in practice.It consists of three steps, repeated until convergence, using ˆabove random variables to indicate sampled values in an MCMC algorithm: Step 1 Draw a posterior sample of parameters θ 2 in the Stage 2 model and set the optimal action âopt i2 = dopt 2 (ō i2 ; θ 2 ), i = 1, . . ., n.
Step 3 Draw a posterior sample of parameters θ 1 in the Stage 1 model using outcome y i1 + η i ŷopt i2 .

AFT BART
Bayesian additive regression trees (BART) form a Bayesian nonparametric regression model developed by Chipman et al.Chipman et al. (2010), which is an ensemble of trees.The accelerated failure time BART Bonato et al. (2011) is an extension of the approach to accommodate censored outcomes assuming the event time follows a log normal distribution.Let t i be the event time, c i be the censoring time for individual i.Then, the observed survival time is s i = min(t i , c i ), and the event indicator is δ i = I(t i < c i ).Denote by x i = (x i1 , . . ., x ip ) the p-dimensional vector of predictors.The relationship between t i and x i is expressed as where f (x i ) is a sum of m regression trees f (x i ) ≡ m j=1 g(x i ; T j , M j ) with T j denoting a binary tree with a set of internal nodes and terminal nodes, and M j = {µ j1 , . . ., µ jbj } denoting the set of parameter values on the terminal nodes of tree T j .Full details of the BART model, including prior distributions and MCMC sampling algorithm, can be found in Chipman et al. (2010).Since the t i of censored observations are not observable, an extra data augmentation step to impute t i is needed in each iteration when drawing Markov chain Monte Carlo (MCMC) posterior samples with Gibbs sampling.In particular, the unobserved event times are randomly sampled from a truncated normal distribution as After data augmentation, the complete log event times are treated as continuous outcomes and the standard BART MCMC draws can be applied.
The AFT BART model with a log normal survival distribution is implemented within the BART R package Sparapani et al. (2021); additional details are found in the Appendix B.

Proposed AFT-BML algorithm
Since the BML approach by Murray et al.Murray et al. (2018) is not directly applicable to censored observations, we extended it by modifying the BIG sampler so that censoring can be accommodated.Here we are interested in the time to an event (such as death) from the start of Stage 1.The Stage 2 treatment decision initiates at an intermediate event such as disease progression.This effectively separates the payoff or event time into two components: the time to the earliest of the event of interest and the intermediate event triggering Stage 2 (t 1 ), and if the patient enters Stage 2 (η = 1), the time from the start of Stage 2 to the event of interest (t 2 ).Observed data accounting for censoring and entry to Stage 2 are denoted (s 1 , δ 1 ) for Stage 1 and (s 2 , δ 2 ) for Stage 2. Continuing with the potential outcomes notation, let t(a 1 , a 2 ) denote the time to the event of interest when action a 1 is taken at Stage 1 and action a 2 is taken at Stage 2. Similarly, let t 2 (a 1 , a 2 ) denote the event time in Stage 2 (starting at the entry to Stage 2) under actions (a 1 , a 2 ).Finally, potential time t 1 (a 1 ) is the time in Stage 1 until the first of the event of interest or entry to Stage 2. Corresponding payoffs on the log time scale are denoted y(a 1 , a 2 ) = log t(a 1 , a 2 ), y 2 (a 1 , a 2 ) = log t 2 (a 1 , a 2 ), and y 1 (a 1 ) = log t 1 (a 1 ).Under consistency, the observed outcome corresponds to the potential outcome for the action actually followed, e.g.t 1 (a 1 ) = t 1 , t 2 (a 1 , a 2 ) = t 2 , and t(a 1 , a 2 ) = t, and similarly for the y = log t versions.Murray et al. Murray et al. (2018) recommended using Bayesian nonparametric regression models in Stages 1 and 2 for robustness.Here we illustrated our approach with AFT BART models in each stage.As before, we use ˆabove random variables to indicate sampled values in an MCMC algorithm.The Stage 2 regression model for t 2 (a 1 , a 2 ) is estimated first, using the observed covariates (ō 2 , a 2 ) and the observed time to event data (s 2 , δ 2 ), according to the AFT BART model We can run the Stage 2 BART model until convergence, draw 1000 posterior samples from the model, and then sample the optimal Stage 2 treatment rule for each MCMC sample according to We also can implement a sampling procedure to generate potential outcomes for the total time from Stage 1 assuming optimal Stage 2 treatment as t(a 1 , dopt 2 ) = t 1 + t2 (a 1 , dopt 2 ).Some of the potential outcomes resulting from this procedure may still be censored, and we denote the possibly censored version of these potential outcomes as (ŝ, δ).This event time data are then modeled as a function of covariates (o 1 , a 1 ) using another AFT BART model given by For each sampled potential outcomes dataset, we run the Stage 1 AFT BART model until convergence, and then draw one posterior sample from each fitted BART model to determine a sample from the posterior of d opt Details of the AFT-BML algorithm are as follows: Step 1 Run BART on the Stage 2 data until convergence and draw 1000 samples from the posterior distribution of f 2 and σ 2 2 .This implicitly involves the following two steps which are performed automatically by the BART package.
Step 2 Draw 1000 samples of (â opt i2 , topt i2 ) for each subject and use each sample to create an augmented dataset for Stage 1 analysis, as follows: Step 2a The optimal action at Stage 2 is chosen as âopt i2 = arg a2 max f 2 (ō i2 , a 2 ).
Step 2b If a i2 = âopt i2 and the observation is an event (δ i2 = 1), topt Step 2d For those who reached Stage 2 (η i = 1), set the observed data for the Stage 1 model as the potential Stage 1 event time ti , e.g.ŝi = ti = t i1 + topt i2 , and set δi = 1.For those who did not reach Stage 2, set the observed data for the Stage 1 model as ŝi = t i1 and δi = δ i1 .
Step 3 Run BART on each of the augmented Stage 1 data sets until convergence and draw 1 sample from the posterior distribution of f 1 and σ 2 1 for each augmented Stage 1 data set.As above, this requires the following two steps which are performed automatically by the BART package: Step 3a Draw unobserved event time ti for censored subjects ( δi = 0) at Stage 1, Step 3b Update f 1 and σ 2 1 with complete (uncensored) data at Stage 1.
Step 4 From each augmented dataset and the corresponding sampled f 1 and σ 2 1 , draw one sample âopt i1 for each subject, based on âopt i1 = arg max The original BIG sampler indicated that the sampling of the Stage 1 parameters should be updated using the values from the prior iteration.However, while that could potentially speed up implementation as it may not require a full burn-in for each new Stage 1 dataset, it is challenging to implement because most BART software does not allow for starting an update step from a specified value of the tree structure and the terminal node means.Instead, we leverage the fact that the BART chain for Stage 2 does not depend on any updates of the Stage 1 model parameters.Because of this, the BART model for Stage 2 can be run independently and used to generate the potential datasets for Stage 1. Once the 1000 datasets for Stage 1 have been sampled, the BART analyses of each of these Stage 1 datasets in Step 3 can be done in parallel using off the shelf BART software.
Our approach for drawing event times for patients who were censored and who received optimal treatment in Stage 2 was to first sample exact event times for Stage 2 data from the Stage 2 model and then pass this value as an event to the Stage 1 dataset (after adding the observed time in the first stage).Alternatively, one could pass the value as censored to the Stage 1 dataset, in which case the AFT BART model would implicitly sample event times using the Stage 1 model, instead of using the Stage 2 model as in the algorithm above.We also implemented and examined this alternative approach in our simulation studies, but found no measurable difference in the results, so we did not consider it further.
We implemented the proposed method by creating a wrapper function called dtr1 that utilizes the BART R package Sparapani et al. (2021).Details on the software implementation can be found in the Appendix.

Simulation Design
We conducted simulation studies with 200 replicated training sets of sample size N = 800 and an independent testing set of sample size n = 400 for each scenario of interest to demonstrate the predictive performance of our method.An observational study with two stages of treatment setting was used with two candidate treatments at each stage.The treatment assignments were generated from a Bernoulli distribution with a probability P (A 1 = 1) and P (A 2 = 1), respectively.Both the event time at Stage 2, as well as the overall event time assuming optimal treatment at Stage 2, were generated from AFT models, assuming a log-normal distribution, similar to the approach of Simoneau et al.Simoneau et al. (2020).The censoring time was assumed to follow a Uniform distribution that led to approximately 20% to 30% independent censoring.
We fit each training data set with our method, and made predictions of the optimal action and the mean of the lognormal event time distribution under optimal treatment at each stage on the test data set.Our performance was compared against Q-learning, including an oracle model along with other models that misspecified the relationship for either stage.We looked at the proportion of optimal treatment (POT), mean-squared error (MSE), and 95% credible intervals coverage rate (CR) (for the BART only approach).The POT is defined at each stage as (1)

Simulation Settings
For subject i, a continuous baseline covariate x i1 was drawn from a U(0.1, 1.29) distribution, and a binary baseline covariate b i1 was from a Bernoulli distribution with probability 0.5.Similarly, a continuous covariate x i2 that was measured at the beginning of Stage 2 was also generated from a U(0.9, 2) distribution, and a binary covariate b i2 measured at the beginning of Stage 2 was randomly drawn from a Bern(0.5)distribution.Additionally there were two noise covariates, z i1 ∼ N(10, 3 2 ), z i2 ∼ N(20, 4 2 ), collected at the beginning of Stage 1 and Stage 2, respectively.When fitting the data, all the stage-wise covariates were included in the models to mimic realworld settings in which there is uncertainty as to which covariates are relevant predictors of the outcomes.The Stage 1 treatment was assigned from a Bernoulli distribution with the probability of receiving treatment P (a i1 = 1) = expit(2x i1 − 1), where expit(x) = exp(x)/(1 + exp(x)) is the inverse of the logit function.For those who entered the second stage (η i = 1), the Stage 2 treatment was sampled from a Bernoulli distribution with P (a i2 = 1) = expit(−2x i2 + 2.8).The probability of entering Stage 2 was fixed at 0.6, i.e., P (η i = 1) = 0.6.
We considered two different scenarios for the relationship between the log event time and the covariates.In Scenario 1, we used an AFT model to generate the event time at Stage 2 as The true optimal treatment a opt i2 , given by I(−0.7 + 0.5x i2 − 0.9b i2 > 0), was plugged into equation ( 2) as a new a i2 to calculate the optimal Stage 2 event time topt i2 had everyone received their optimal treatment at Stage 2. The overall event time assuming optimal Stage 2 treatment was generated again from an AFT model as For those who did not enter Stage 2, ti was their event time.For those who entered Stage 2, the observed Stage 1 survival time was t i1 = ti − topt i2 , and the Stage 2 event time was t i2 .The censoring time c i was generated from U (100, 2000) to yield an overall censoring rate of around 20%.
As a comparator, we fitted the data with parametric Q-learning models as well.Since there were two stages in our simulation data, we chose either correctly specified (T) or misspecified (F) Q-function models for each stage as Combining the two stages together yields four possible modelling specifications: , and Q 1F 2F .Among these four Q-learning models, Q 1T 2T correctly specifies the parametric form in both stages; we refer to this as the oracle model.
In Scenario 2, we followed a similar structure to simulate the data but with a different set of true models that include non-linear transformations of the covariates.The event time at Stage 2 was generated based on the following equation as The true optimal treatment, a opt i2 = I(0.7x 2 i2 − 1 > 0), was used to replace a i2 in equation ( 4) to calculate the optimal Stage 2 event time topt i2 .The overall event time ti assuming optimal Stage 2 treatment was generated as log ti = 7.4 + sin( The Stage 1 and Stage 2 survival times were calculated in the same way as in Scenario 1, depending on whether the observation entered the second stage.Censoring time c i was now generated from U (400, 5000) to achieve an overall censoring rate of around 30%.
Based on the underlying true nonlinear functions of covariates, we constructed two misspecified Q-learning models besides the oracle model.The first misspecified model Q lin considered only linear terms in the covariates for both stages as Stage 1 Q lin : The second misspecified model Q int considered all two-way interactions among covariates and all interactions between treatment and covariates in each stage in addition to the linear terms in Q lin as such that these models were not correctly specified but were nonetheless richer and more flexible than their only 'linear' counterparts.

Simulation Results
For the proposed method, denoted as BART in the figures, we created a wrapper function dtr1 that implemented the algorithm described in Section 2.4; further documentation of this implementation is available in the Appendix.
For Q-learning, we first used survreg function from the R package survival Terry M. Therneau and Patricia M. Grambsch (2000); Therneau (2022) to fit the Stage 2 model, then made predictions of the optimal second stage treatment and corresponding optimal survival time to create Stage 1 data.The survreg function was called again to fit the new augmented Stage 1 data and estimate the optimal first stage treatment with corresponding optimal overall survival time.The general framework is the same as the proposed method.Given that all the covariates at both stages were simulated for every subject, the related Stage 2 treatment and time were predicted for everyone in the test set.This is unrealistic in practice since some covariates are not available if a patient never entered Stage 2.
The proportion of optimal treatment is defined as the ratio of the number of subjects that has the true optimal treatment correctly identified by the model in a specific stage and the total number of subjects in the test set, that is 400.For the stage-wise POT, the subject is counted in the numerator if the optimal treatment matches with the truth in a specific stage, as shown in equation (1).For the overall or combined POT, only those who have the true optimal treatment correctly identified at both stages are included in the numerator, as in the following expression: 1 It is straightforward to calculate POTs with Q-learning since that approach makes only one prediction of the optimal treatment at each stage for each observation.With our Bayesian approach, a little additional effort is needed.Since BART draws posterior samples to describe the posterior distribution, the raw results from our approach are also a set of posterior samples.Specifically, every single posterior sample consists of the outcomes under each candidate treatment at each stage for each observation.Comparing the corresponding outcomes across the candidate treatments, a collection of the optimal treatment and optimal outcome at each stage can be obtained for each observation and each posterior sample.With a set of posterior samples of optimal treatment, we can compute the posterior mean of optimal treatment probability, and the one with the highest posterior mean is the final prediction of the optimal treatment at each stage for each observation.That is the quantity we used to calculate the POTs for AFT-BML.The mean squared error (MSE) was calculated by comparing the estimated optimal Stage 2 and overall log event time means to the true means.Figure 1 shows the decomposition of MSE at both stages, as well as the stage-wise POT and overall POT, for the proposed method, oracle Q-learning model, and the three other misspecified models in Scenario 1.Notice that in both equations ( 2) and (3), the relationship between log event, time and covariates are linear.As a parametric method that has the right structure as the underlying true model, the performance of the oracle model bests the others, with a very small MSE with zero bias, close to 100% stage-wise POT at both stages and a 92% overall POT.For Stage 2, Q 1F 2T has the same MSE and stage-wise POT as the oracle model since they specified the functional form in the exact same way.Among the other three models, our method performs the best, in terms of a smaller MSE with an even smaller bias, and a higher stage-wise POT with the difference greater than 20%.For Stage 1, the MSE from Q 1T 2F is slightly bigger but very similar to the oracle, and the stage-wise POT is almost the same as the oracle model, even though the predicted Stage 2 optimal survival time from Q 1T 2F was based on a misspecified Stage 2 model.This is mainly due to the fact that the simulated Stage 2 event time was relatively small compared to the overall event time so that an incorrect prediction for Stage 2 has a minimal impact on the augmented overall survival time.This resulted a very similar data set between Q 1T 2F and the oracle when fitting the Stage 1 model.The proposed Bayesian method, as in Stage 2, has the smallest MSE and the highest stage-wise POT compared to the other two models (Q 1F 2F and Q 1F 2T ).Our proposed method is better than all approaches except the oracle approach when taking optimal treatment for both stages into consideration using the overall POT.
The results from Scenario 2 are shown in figure 2. The relationship is nonlinear between the covariates and log event time in both equations ( 4) and (5).As expected, the oracle model has close to zero MSE and close to 100% POTs.Our method outperforms the other two models (Q lin and Q int ) with a much smaller MSE and higher POTs.The magnitude of the differences in figure 2 are larger than those in figure 1.The advantage of our nonparametric method becomes more obvious in exploring the nonlinear dependencies, while the other two parametric Q-learning models suffered from incorrect model structures.
Another quantity that is not easily estimable for Q-learning but comes without extra cost for BART is the measure of uncertainty in the estimated optimal values.By drawing MCMC posterior samples, the standard error of log optimal survival times can be calculated as the sample standard deviation.The credible intervals of log event time are also derivable with a collection of posterior samples.On the contrary, to obtain the standard error and confidence interval with Q-learning, bootstrap sampling must be carried out.Here, we only show the coverage rate (CR) of 95% credible intervals for our method in figure 3. Q-learning models are not presented because of the poor fit and high biases in figure 1 and 2. The boxplot of 95% CR for both scenarios are almost always above the nominal 95% at Stage 1, and always cover with the lower quartiles above the nominal 95% at Stage 2. This indicates that the proposed method has good accuracy in estimating the uncertainty in the log event time mean under optimal treatment for both stages although the CRs are slightly over 95%.

Motivating Analysis: Optimal Treatment for AML Patients undergoing Transplant
In this section, we applied the proposed method to data from the Center for International Blood and Marrow Transplant Research (CIBMTR) (Krakow et al. Krakow et al. (2017)).There are 4171 patients with complete information in this data who received graft-versus-host disease prophylaxis for their allogeneic hematopoietic cell transplant, which was used to treat their myeloid leukemia, between 1995 and 2007.Some patients were subsequently given a salvage treatment after they developed GVHD and experienced unsuccessful initial treatment.The two stages considered in this study were up-front GVHD prophylaxis treatment and salvage treatment after developing GVHD and failing initial treatment (which is consistently given as steroids).In each stage, patients were assigned one of two treatments, nonspecific highly T-cell lymphodepleting (NHTL) immunosuppressant therapy or standard prophylaxis immunosuppressant.Estimating an optimal DTR to maximize the overall disease-free survival time for patients is our primary goal.The primary outcome is time to death, disease persistence, or  Both Q-learning and AFT-BML approaches were used to fit this two stage survival data DTR estimation.All the main effects and the two-way interactions between stage-wise treatment and the other covariates are included in Q-learning models, and 1000 nonparametric bootstrap resamples were generated to estimate the uncertainty of the quantities of interest.For nonparametric AFT-BML, with 1000 MCMC posterior samples, the full distribution was available for any predictions.The point estimates of parameters along with bootstrap mean and 95% confidence interval (CI) at each stage were examined for Q-learning.The waterfall plots for the mean differences in DFS on the log time scale under each treatment at each stage for each individual were created for both Q-learning and AFT-BML, as well as the 95% and 50% credible intervals (bootstrap CIs for Q-learning) presented on the same plot.The differences in the median DFS were also explored for both methods, as were the differences in the two year DFS probabilities.
The analysis results from Q-learning, including point estimates and bootstrap mean, as well as the bootstrap 95% CI, are shown in the appendix tables 3 and 4. Inspection of the 95% CI for the interaction terms in table 3 reveals covariate combinations that can be used to identify subgroups where the NHTL or the standard treatment is preferable.For example, assuming all the other covariates are at the reference level, an unrelated donor would benefit more from NHTL than the standard treatment at Stage 1. Similarly in table 4, when holding the other covariates at the reference level, a patient who received NHTL at Stage 1 would be expected to have a longer DFS time if the standard treatment was given at Stage 2, since the 95% CI of A2.NHTL*A1.NHTL is negative.
Intuitively, this might be the case because salvage treatment that is different than the initial treatment which already failed might be expected to be more effective.
In figure 4, we present the estimated treatment differences on the log time scale for each stage, along with 95% CI and 50% CI.For AFT-BML, a direct posterior prediction difference for each individual can be calculated from 1000 MCMC posterior samples based on patient-level characteristics.The credible intervals are constructed using the quantiles of posterior samples of each patient.For Q-learning, a standard error can be estimated from the predictions of 1000 bootstrap resamples at an individual level.Using the estimated standard error, the CIs for each patient are calculated in a standard way.A positive difference means NHTL is the preferred treatment for a given stage.The patients are presented in descending order based on the estimated difference, separately for each method.
The results from AFT-BML (figures 4a and 4c) suggest that there is little to be gained by individualizing the treatment in Stage 2 since everyone benefits from the standard treatment, but at Stage 1 there may be significant clinical value in choosing treatment in a personalized fashion to maximize the overall DFS time.In fact, there may be four subgroups that have a distinct difference in expected log event time, in which two groups would benefit from NHTL, one group is indifferent to treatment choice, and one final group that would have longer survival time with the standard treatment.The right panels, showing figures 4b and 4d, provides a similar message using Q-learning: while the standard treatment is the preferred treatment for most (though not all) patients at Stage 2, individualizing the treatment at Stage 1 may lead to important benefits in log DFS time.However, there is a more continuous spectrum of treatment differences in Stage 1 using Q-learning, compared to AFT-BML.
In figure 5, we present the predicted treatment differences on a different time scale, specifically the estimated differences of median DFS time with 95% CI and 50% CI at each stage.Since the event time is assumed to follow a log normal distribution, the exponential of the log normal mean, which transforms the predictions back to the original time scale, is the median rather than the mean.The results on this scale are generally consistent with the log time scale results, though the scale change produces some unusual results.There are some patients in the middle of figure 5a who have wider CIs than their neighbors.These are patients whose survival predictions are higher (under both treatments), and whose corresponding prediction variances are also higher, leading to a wider CI.Similar observations can be seen on both ends of the same figure, where the patients who are predicted to live longer on either treatment have wider CIs as well.
DTRs could also be defined as the optimal treatment rules to maximize each patient's 2-year DFS probability.
Figure 6 shows the 2-year DFS probability difference between NHTL and standard treatment for all patients sorted in descending order.For Stage 2, all patients are expected to have a higher 2-year survival probability if assigned the standard treatment based on AFT-BML method.The Q-learning method agrees with the inferences from AFT-BML except for a very small proportion of patients, although the magnitude of the differences are much bigger for Q-learning.For Stage 1, it is easy to notice that AFT-BML splits the patients into four subgroups, similar to what we saw in figure 4a.With Q-learning, however, it is difficult to recognize any clear cut points on the curve.As in Stage 2, the magnitude of the differences are smaller for AFT-BML.The proportions of patients who should have NHTL as the optimal treatment for Stage 1 are consistent between AFT-BML and Q-learning.
To further compare the survival probability predictive performance of AFT-BML vs. Q-learning, we calculate the time dependent area under the ROC curve Heagerty et al. (2000) with the R package timeROC Blanche et al.
(2013) using the predicted survival time estimated by both AFT-BML and Q-learning as predictors.For the Stage 2 model, the observed time and event indicator can be used directly in calculating the time dependent AUC.For the Stage 1 prediction model (which assumes patients receive optimal treatment in Stage 2), we need to account for not all patients receiving their optimal treatment in Stage 2. To handle this, depending on whether the estimated optimal treatment at Stage 2 was observed or not, the original observation was kept as is, or was censored at the time of entering Stage 2. Since the estimated optimal Stage 2 treatment could be different from AFT-BML to Q-learning, we examined three sets of censored  To visualize the AFT-BML based DTRs, we applied the 'fit-the-fit' method and plotted a single tree as in Logan et al.Logan et al. (2019).Since there is little value in differentiating the treatment for Stage 2, we only focus on the Stage 1 model here.Here the outcome used for the single tree fit is the posterior mean treatment difference of the log survival time, although other outcomes such as median DFS or DFS probabilities at fixed timepoints could also be used as outcomes.The R 2 goodness of fit measure for using a single tree in figure 7 to model the Stage 1 posterior mean differences in log mean DFS time predictions is above 90%, indicating that this (highly interpretable) single tree is a reasonable representation of the original AFT-BML model at Stage 1. Values in the nodes are the posterior mean differences in mean log DFS time in months between NHTL and standard treatment.
The corresponding 95% CIs are also shown in the same node.The first split (on donor type) indicates that almost all patients receiving unrelated donor transplants would benefit from receiving NHTL as GVHD prophylaxis for their AHCT, while relatively few patients receiving related donor transplants should receive NHTL.As the tree grows, the patients can be divided into four subgroups.After the first split, the two bottom nodes on the left are well apart from each other, as are the two nodes on the right side.These observations agree with 4a, that identified four subgroups with a distinct posterior mean difference in log DFS times.

Conclusion
In medical practice, it is of important clinical value to select the optimal treatment based on the patient's characteristics.In many settings, a sequence of optimal treatments is desired, such as for diseases like cancer that can progress.The AFT-BML approach for identifying the DTRs can assist physicians to make sound, data-supported decisions at each stage.Classical parametric approaches, such as Q-learning, are constrained by the necessity of correctly specifying functional forms, including all the interaction terms among covariates and treatment.The Bayesian machine learning approach, in contrast, avoids this restriction and allows the model to adaptively determine the relationships between the outcome and the covariates, facilitating optimal treatment identification.We have extended the Bayesian machine learning approach to censored survival data in an AFT model framework, and also provide parallelizable code for implementation of the computationally intensive algorithm.This wrapper function utilizes standard available BART software without needing to modify the complex underlying BART code.With the simulation studies, we have shown that the AFT-BML approach can achieve almost the same performance as the oracle model for censored outcomes.The results from AFT-BML not only include the optimal treatment and optimal outcomes that classical parametric approaches can provide, but also directly offers the uncertainty measurement for these targets of inference.This extra uncertainty information could be useful in practice since physicians could better assess the confidence they should have in the recommended optimal treatments.
Our comparison to Q-learning in the simulations and the example used a fixed model specification without variable selection.With a larger number of covariates, a penalized AFT model could be used for variable selection in the Q-learning approach.Furthermore, we utilized bootstrapping to estimate the uncertainty of the Q-learning predictions.Alternative approaches, such as penalized Q-learning Song et al. (2014), could be used to directly provide the uncertainty measurements.However, we are not aware that this approach has been extended to censored data models, such as the AFT model used here.
We compared model performance between AFT-BML and Q-learning in the example by censoring the patients who did not receive optimal subsequent treatment.This was feasible in this data set because most of the patients received optimal treatment in Stage 2. However, a general strategy of assessing the model performance in the dynamic treatment regimes setting would be worth further investigation.
There are some limitations in our approach.We have demonstrated the BML approach for censored data using a parametric log normal AFT model, which has substantial parametric model assumptions even though the functional form of the covariate effects is flexible; other types of Bayesian survival models Sparapani et al. ( 2016 2021), could alternatively be used which require fewer assumptions.In such cases, the methodology described here and our wrapper function can serve as a template for implementing alternative models in a BML DTR framework.
Another limitation is that our AFT-BML approach and wrapper function are currently implemented for the two-stage AFT-BART survival model.Modifying our current approach to handle more than two stages would introduce some computational challenges.The computational time would increase since we will have more layers of chain burn-in if more than 2 stages are present.This limitation could be reduced if the multi stage model fittings were embedded together, though this would lose the flexibility of applying BART functions off-the-shelf.
Essentially one would need to output the tree structure and terminal node means at the end of one update, and then pass these to the next imputed dataset being analyzed as an initial tree structure and terminal node mean that is being updated.

B Software
The R wrapper function that we created for this chapter can be found at https://github.com/xiaoli-mcw/dtrBART.Simply download and save the wrapper.R file.Note that this function utilizes some functions from the BART3 R package, which is not available on CRAN yet.The BART3 package can be found at https: //github.com/rsparapa/bnptools.The function dtr1 can be called conveniently after sourcing wrapper.R using source("myfolder/wrapper.R") Remember to replace myfolder with the actual directory where wrapper.R is saved.If a newdata is provided, predictions of optimal action and optimal outcome for the new data will be returned.opt is TRUE or FALSE, indicating whether only the optimal action and survival time at each stage will be returned (TRUE) or whether additional survival times under each action option at each stage will be returned (FALSE).mc.cores specifies the number of threads to be used in the calculation.With more threads, the program will execute more quickly.
There is a demo data set dtrdata.csv in the same repository on GitHub which is used to explain the results returned by our wrapper function.Here, the dtrdata with 1000 observations is split into training (80%) and testing (20%) data.

Figure 1 :
Figure 1: Mean squared error decomposed into variance and bias 2 for Scenario 1, in which there is linear dependence of the outcome on covariates at Stage 1 and 2. The left y-axis corresponds to MSE (vertical bars); the secondary y-axis corresponds to stage-wise and overall POT (solid and dashed lines, respectively).

Figure 2 :
Figure 2: Mean squared error decomposed into variance and bias 2 for Scenario 2, in which there is nonlinear dependence of the outcome on covariates at Stage 1 and 2. The left y-axis corresponds to MSE (vertical bars); the secondary y-axis corresponds to stage-wise and overall POT (solid and dashed lines, respectively).

Figure 3 :
Figure3: The coverage rate (CR) of 95% credible intervals for the log event time mean under optimal treatment from the proposed method for Scenario 1 (Left) and Scenario 2 (Right) with the red reference line indicating the nominal 95% level.

Figure 4 :Figure 5 :Figure 6 :
Figure 4: Predicted mean difference of log DFS time among patients who received AHCT with NHTL versus standard treatment.
The Stage 2 regression model for y 2 (a 1 , a 2 ) is estimated first, using the observed covariates (ō 2 , a 2 ) and the observed response variable y 2 .Based on the estimated Stage 2 model, the estimated optimal mapping from Ō2 to A 2 , simply denoted as d opt 2 , can be identified, as well as the relevant potential payoff at Stage 2, denoted as y 2 (a 1 , d opt 2 ).With d opt 2 and potential payoff y 2 (a 1 , d opt 2 ), the response variable for Stage 1 can be constructed as y(a 1 , d opt 2 ); this (potential) outcome is composed of the observed Stage 1 payoff y 1 and the potential Stage 2 payoff y 2 (a 1 , d opt 2

Table 1 :
Treatment assigned at Stage 1 and 2 Stage 1 data, including optimal Stage 2 treatment identified by AFT-BML, or Q-learning, or consistently optimal under both Stage 2 models.The time points of interest for Stage 1 are one year, two years, and three years.For Stage 2, only the median and third quartile of the observed time are evaluated.The results from both stages are shown in Table2.For Stage 2, AFT-BML improves the AUC by 0.55% at the median, and 1.87% at the third quartile, indicating that AFT-BML has a better predictive performance at Stage 2. For Stage 1, the time dependent AUC at 1 year from AFT-BML is

Table 2 :
Time dependent AUC for Stage 1 and Stage 2 with either AFT-BML model or Q-learning model.For Stage 1, observations were censored at entry to Stage 2 for calculation of the time dependent AUC if they did not receive optimal treatment in Stage 2 (with optimal treatment determined using AFT-BML, using Q-learning, or when both agreed).
The data should have all the covariates including treatments observed in both stages, an indicator for entering Stage 2 defined as η i in Section 2.4, an overall event indicator δ i regardless of entering Stage 2 or not, a survival time for Stage 1, and a survival time for Stage 2. For those who did not enter Stage 2, since their information are not used in fitting Stage 2 model, their survival time for Stage 2 can be coded in any reasonable way.In this function, x1 is a vector of covariate names that are used in fitting the Stage 1 model.a1 is the variable name of the action in Stage 1. time1 is the variable name of survival time at Stage 1. x2, a2, and time2 are similar for Stage 2. stg2ind is the variable name indicating whether an individual entered Stage 2 indicator.delta is the variable name of the overall event indicator.data is the data set.