Predicting Radiotherapy Patient Outcomes with Real-Time Clinical Data Using Mathematical Modelling

Longitudinal tumour volume data from head-and-neck cancer patients show that tumours of comparable pre-treatment size and stage may respond very differently to the same radiotherapy fractionation protocol. Mathematical models are often proposed to predict treatment outcome in this context, and have the potential to guide clinical decision-making and inform personalised fractionation protocols. Hindering effective use of models in this context is the sparsity of clinical measurements juxtaposed with the model complexity required to produce the full range of possible patient responses. In this work, we present a compartment model of tumour volume and tumour composition, which, despite relative simplicity, is capable of producing a wide range of patient responses. We then develop novel statistical methodology and leverage a cohort of existing clinical data to produce a predictive model of both tumour volume progression and the associated level of uncertainty that evolves throughout a patient’s course of treatment. To capture inter-patient variability, all model parameters are patient specific, with a bootstrap particle filter-like Bayesian approach developed to model a set of training data as prior knowledge. We validate our approach against a subset of unseen data, and demonstrate both the predictive ability of our trained model and its limitations. Supplementary Information The online version contains supplementary material available at 10.1007/s11538-023-01246-0.


Introduction
Radiotherapy remains a mainstay of cancer treatment, with approximately half of all cancer patients receiving radiotherapy as part of their standard of care [1][2][3].It is common for a patient's course of treatment to be determined solely by tumour etiology, location, and stage.Other patient-specific factors, such as the intrinsic radiosensitivity and composition of a tumour, are not typically used to inform protocol selection in the clinic [4].Clinical studies suggest that patients at a similar tumour, node, and metastasis (TNM) stage, and with comparable pretreatment tumour volumes, may respond differently to the same radiotherapy fractionation schedule [5,6].Mathematical models have the potential to capitalise on real-time clinical observations to both predict patient specific responses and guide clinical decision-making.It is hoped that such a tight integration could eventually be used to personalise fractionation schedules either a priori or adaptively during a patient's course of treatment [7].
Challenges associated with the application of mathematical models to interpret data and draw predictions are perhaps most acute for single-patient clinical data.Models must be sufficiently complex to reproduce the full gamut of patient responses [8][9][10].However, clinical data are often limited, typically comprising solely noisy measurements of the gross tumour volume (GTV) at sparse time intervals throughout a patient's course of treatment [10,11].The necessity to start treatment as soon as possible after diagnosis means that pre-treatment predictions are often drawn from only one or two observations.Consequently, models aimed at clinical applications are relatively simple [12,13], incorporate limited biological detail, and often describe only the time-evolution of the GTV [6,12,13].While simplicity can elicit parameter identifiability and avoid overfitting, predictions can be poor-or even misleading-if a model is so simple as to be unable to capture the range of observed (possible) responses.The dangers of overfitting are particularly pronounced for single-patient clinical data used for prediction, where model validation must be assessed pre-treatment; in diametric opposition to experimental data, technical replicates are never available.It is, therefore, crucial to validate models across a wide range of responses, and to accurately quantify uncertainty in predictions used in clinical decision-making [10].
In this work, we present a predictive mathematical modelling framework using clinical GTV data from a previously published cohort of head-and-neck cancer patients who exhibit a variety of treatment responses (Fig. 1) [14,15].The primary goal of our framework is to integrate previously observed clinical observations to predict the time course of radiotherapy response in new patients.To demonstrate our framework, we focus our analysis on prediction of the tumour volume progression in four patients presented in Fig. 1 and in our previous work [16]: these patients are excluded from the otherwise randomly-selected cohort of patients used to train the mathematical model.All patients in the clinical data set receive a standard radiotherapy fractionation schedule, comprising fractions of 2 Gy delivered on weekdays over a four-to sevenweek period [17].To keep our study as widely applicable as possible, we work with the most fundamental, albeit limited, mode of single patient data.Computed tomography (CT) scans are routinely used to image tumours pre-treatment at both the diagnosis and treatment planning stages (Fig. 1a) [18][19][20].Further scans, such as cone beam CT, may be taken upon the delivery of each fraction but are usually used solely for alignment purposes; for our data, scans were available once per week during treatment.These CT scans are not of a high spatial resolution, are noisy, of a low contrast, and do not differentiate heterogeneity in tumour composition.As such, only noisy measurements related to an estimate of the GTV are available at relatively sparse intervals throughout each patient's course of treatment (typically, once per week).The heterogeneity in radiotherapy response exhibited in Fig. 1b-e raises several important questions: in particular, how early into treatment can a practitioner determine if a patient is responsive, and to what extent is it possible to predict the final tumour volume during treatment using only GTV measurements?Given the side-effects associated with radiotherapy, and possible indirect costs of switching treatments at too late a TNM stage, any improvement in prediction accuracy is of great clinical value.
Mathematical models of tumour progression vary significantly in complexity; ranging from simple phenomenological models of GTV, such as logistic and Gompertz growth [12,[21][22][23][24][25][26][27], to highly detailed spatial models that capture multiple facets of tumour heterogeneity [16,23,[27][28][29].The limitations and challenges imposed by clinical data yield an overrepresentation of the former, meaning that the functional forms for both growth and radiotherapy response are motivated almost entirely by empirical observations rather than the underlying biological mechanisms.Yet, it is now well established that intra-tumour heterogeneity and the tumour microenvironment play important roles in overall growth, and may significantly influence treatment outcome [16,[29][30][31][32][33].Motivated by these findings and in consideration of the noisy data available for prediction, we take an intermediate approach and utilise a two compartment extension of the so-called proliferation-saturation-index (PSI) model of Prokopiou et al. [12] and later Poleszczuk et al. [26].This choice of ordinary differential equation (ODE) model balances simplicity, through a phenomenological description of radiation-free tumour growth saturation, with biological detail, through a radiotherapy response corresponding to a transfer of cellular material from a living to a dead state.Compared with purely statistical or machine learning models, our mathematical approach allows a full interpretable integration of clinical data from individuals, whereby the radiotherapy schedule is imported directly from the reported patient fractionation schedule.Finally, our model contains sufficient detail to allow us to quantify the potential utility of expanding clinical data collection to include information relating to tumour composition in addition to GTV.
We take a Bayesian pseudo-hierarchical approach to inference and model calibration, by leveraging observed population-level information to draw predictions and quantify corresponding levels of prediction uncertainty.To account for inter-patient heterogeneity, all model parameters are allowed to vary between patients.A schematic of the approach is provided in Fig.

Methods
In this section, we outline the clinical data, and the mathematical and statistical methodology developed and later employed in this work.First, in Section 2. we outline the novel statistical methodology employed in the analysis.Finally, in Section 2.5 we outline the procedure for resampling from the joint posterior to produce synthetic patient data.A Julia implementation of the model and inference algorithm, along with data used in the analysis, are available on GitHub * .

Tumour volume data
Current clinical practice involves two CT scans collected for each patient; one at diagnosis and one at treatment planning.These scans are then used to estimate GTV [18][19][20].While it is feasible to obtain further scans at the time of delivery of each fraction, these scans are often of a low quality, being used primarily to position the patient.As such, they are not typically stored for research purposes.
In this paper, we use retrospective volumetric data, collected weekly, from head-and-neck cancer patients, across multiple anatomical locations, including the oropharynx, tonsil and base of tongue.Patients were immobilised via a thermoplastic mask with or without bite block.
Isocenter and positioning was verified daily via orthoganal kV or CBCT imaging.Each CT scan was segmented by the same radiation oncologist, giving weekly tumour volumes throughout treatment in addition to a volume measurement at the treatment planning stage.Weekly cone beam CT (CBCT) scans were extracted from the record and verify system (Mosaiq, Elekta).
Suitable CBCTs with minimal artifact were selected for contouring.Clinical target volume (CTV) was created from GTV with a 5 mm isotropic expansion.CTV was then trimmed from barriers to spread including air, bone, fascial planes, and in some cases muscle.Planning target volume (PTV) was created from CTV via 3 mm isotropic expansion.An example suite of contoured CT scans from a single patient is shown in Fig. 1a.The GTV data shown in * https://github.com/ap-browning/clinical-inferenceFig. 1b-e correspond to those presented and discussed in Lewin et al. [16].In total, GTV data from 51 patients was collected and made available as supplementary material.All methods were carried out in accordance with the institutional policies of the Moffitt Cancer Center.The clinical protocol covering patient data and methods used in this paper was approved by the Moffitt Cancer Center's Institutional Review Board (IRB).Since this is a retrospective study using de-identified data of adult human subjects, informed consent was waived by the IRB.
Given the limitations imposed by GTV clinical data, we present a relatively simple mathematical model that is able to capture the four classes of tumour response observed in Fig. 1.
In particular, we extend the PSI model [26] where δ(t − t i ) is a delta function, representing a transfer of a volume γL from the living compartment to the dead compartment, such that γ d −1 quantifies the strength of radiotherapy response.We assume further that necrotic material is degraded at a constant rate ζ d −1 .To capture inter-patient heterogeneity, all parameters are allowed to vary between patients [47].
The data suggest that initial GTV is comparable between responsive and poorly responsive patients (Table 1).Therefore, we normalise L(t) and N (t) with the initial GTV such that V (0) = 1 and describe the initial tumour composition as where 0 ≤ ϕ 0 ≤ 1 is an unknown, patient-specific parameter to be estimated that represents the proportion of the tumour occupied by dead material at t = 0. We note further that the interpretation of the carrying capacity parameter K is with respect to the measured initial GTV.Thus, GTV measurements presented throughout the paper may be interpreted as the fold change (FC) compared to the initial GTV.The interpretation of all other parameters remains unchanged by this choice of units.
In the supplementary material (Figs.S1 and S2), we perform a parameter sweep across parameters relating to necrosis and necrotic material decay (η and ζ, respectively), for a patient subject to daily doses of radiotherapy on weekdays over a six week period, to verify that the model is able to reproduce the wide range of dynamics observed in the clinical data.While the parameter sweep is not exhaustive, the results demonstrate that varying only these two parameters is sufficient to produce the range of responses observed in Fig. 1.

Classifying responses
We observe four classes of qualitative response within the clinical data, as highlighted in Fig. 1 and summarised in Table 1.In Fig. 1b, the patient responds well to radiotherapy, with the tumour decreasing markedly in volume throughout treatment.Hereafter, we refer to a patient exhibiting this type of behaviour as a fast responder.By contrast, there are patients for whom the effects of radiotherapy appear to be marginal when viewed in terms of tumour volume over time alone, as is the case in Fig. 1c.We classify these patients as poor responders.In a number of cases, the initial response of the tumour to radiotherapy appears to be favourable, but the response plateaus in the latter stages of treatment, resulting in a non-negligible final tumour volume (Fig. 1d).Such patients are classified as having a plateaued response.However, this radiographic volume may subsequently recede in the weeks after radiotherapy.Occasionally, as in Fig. 1e, a patient may appear to exhibit continued tumour progression throughout the first few weeks of radiotherapy before showing a delayed response, characterised by a decrease in tumour volume towards the end of treatment.We characterise this type of response as pseudo-progression.
We classify a model realisation into one of four classes of response based on a standard patient receiving doses on weekdays over a six week period, with CT measurements taken at the start of each treatment week and at the time of the final dose (the pre-treatment volume measurement is not used to classify patients).Based on the set of noise free synthetic measurements generated Table 1.Prior classification of each patient response class, based on the full posterior, p(θ|{D i } n i=1 ) and the second-level prior p 2 (θ), the latter corresponding to an expanded kernel density estimate constructed from samples of the full posterior.The statistics related to the initial volume are based on the classifications of the prior samples corresponding to each patient in the training set, hence non-integer counts arise due to probabilistic classification of patients.An approximate statistical test, based on Welch's approximate unequal variance t-test [48], indicates no statistically significant difference between fast and poor responders (P = 0.582), nor between responders ( * ) and poor responders (P = 0.557).Asterisks indicate classifications corresponding to patients who show an eventual response.The specific thresholds chosen in the classification algorithm yield excellent results that reliably distinguish between each class (Fig. S4).However, the relatively small number of plateaued responders and pseudo-progressors in the training set (Table 1) suggests that the criteria will need to be reassessed should more data become available.

Statistical model
We take a standard approach and assume that CT scan data are independent and normally distributed about the model prediction [49] such that where the standard deviation is assumed to be a linear function such that the statistical model captures both additive and multiplicative normal noise: α 1 represents an absolute contribution to the variance, and α 2 a relative contribution.
While the dynamical parameters are assumed to vary between patients, we assume that the noise parameters remain fixed.Therefore, we pre-estimate the noise parameters α 1 and α 2 by first inferring them alongside dynamical parameters for each patient.We then pool an equal number of noise parameter posterior samples for each patient and approximate (α 1 , α 2 ) as the marginal posterior mode.We are motivated to take this relatively standard approach of pre-estimating the noise parameters to reduce both the dimensionality of the parameter space and the complexity of the statistical methodology.

Bayesian inference
An important difference between clinical and experimental data relates to the sample size: in clinical studies, each patient undergoes therapy only once.Given that patients are highly heterogeneous and data are relatively limited (Fig. 1), this poses a significant statistical challenge Parameter Units Prior Description for computational inference.To account for this, we take a pseudo-hierarchical approach to inference and prediction by first training the model on a subset of the data (the training set).
We are motivated to develop this novel approach to inference as opposed to a more standard Bayesian hierarchical approach as there is no sensible means by which to propose a particular distributional form for the joint parameter distributions at the population-level: given the distinct classes of response observed in Fig. 1, for example, we expect the joint parameter distribution to be multimodal.The correlation structure between model parameters is also unclear.
From a full cohort of 51 patients, we randomly select a group of 40 patients to act as the training set; these patients represent those that have been observed throughout an entire course of treatment, prior to the present.For each patient in the training set, we assume that initial knowledge about the model parameters is encoded in a "first-level prior", p 1 (θ), where θ = (log λ, log K, log γ, log ζ, log η, log ϕ 0 ) (Fig. 2).We then update our knowledge about the parameters pertaining to patient i using Bayes theorem such that where D i represents data (including both volume measurements and the radiotherapy schedule) for patient i.We choose p 1 (θ) to an independent multivariate uniform (see Table 2 and Fig. 3), an uninformative choice.
The posterior for patient i can be interpreted as the full posterior, conditioned on knowledge that the parameters relate to patient i The full posterior can be obtained by marginalising over all patients in the training set and is given by where w i = P(i) represents the prior probability (i.e., weighting) of patient i.The result in Eq. ( 7) follows immediately from Eq. ( 6) by the law of total probability.For simplicity, we set w i = const, however, such weights may be allowed to differ if additional knowledge informs patient similarity; for example, based on characteristics known to affect radiotherapy response, such as the clinical stage or age of a patient [50].Another way to interpret the full posterior is that of a uniform mixture of the individual-level posterior distributions.We then denote the full posterior as the "second-level prior", p 2 (θ), which represents our knowledge about the parameters when analysing new patients (we drop notational dependence on already observed data for convenience) (Fig. 2).An interpretation of our procedure is to identify the similarity between the new patient and the observed treatment outcomes for patients in the training set, and to combine the additional knowledge obtained from past patients when predicting outcomes for the new patient.In Fig. 3 we compare the first-level prior p 1 (θ) to the full posterior, and in Fig. 4 we show pairwise marginal distributions of samples from the full posterior Eq. ( 7).
Given a possibly temporally incomplete set of measurements from a new patient, D new , the posterior distribution of the parameters is again given by A simple technique to obtain a set of weighted samples from p new (θ|D new ) is to apply a bootstrap particle filter to pre-obtained samples from p 2 (θ).Since patients in the training set are weighted equally, these may comprise a concatenation of samples from each posterior (we obtain these using an adaptive MCMC algorithm [51], diagnostic statistics and convergence plots are given as supplementary material).An advantage of the bootstrap particle filter approach is that it requires minimal computational effort to update the posterior for new patients.The primary limitation introduced by this choice is that we cannot distinguish between parameters that vary between patients and those that are fixed: hence, we pre-estimate and fix the noise parameters in this work.
In practice, this approach may be problematic since patients in the training set are unlikely to be identically representative of new patients, particularly for small training sets (in our case, n = 40).In the bootstrap particle filter, this would lead to a small number of heavily weighted particles (that may or may not produce model realisations similar to the new patient data).We address this potential issue by forming p 2 (θ) by resampling perturbed particles from p(θ|{D i } n i=1 ) using a multivariate normal distribution with covariance matrix, denoted Σ ε , constructed by expanding the covariance matrix of Silverman's rule for kernel density estimation, where β is an expansion factor (we choose β = 2), m is the number of samples of θ|{D i } n i=1 and Σ θ is the covariance matrix of the samples.We reject samples outside the support of the first-level prior p 1 (θ) (see Table 2), in effect constructing p 2 (θ) as a kernel density estimate with truncated multivariate normal kernels.This approach is also similar to a one-step sequential Monte Carlo algorithm [52].

Quantifying goodness-of-fit
We quantify goodness-of-fit using the so-called Bayesian R 2 statistic [53], defined for a single posterior sample by where V fit denotes the set of fitted values, and V obs denotes the set of observed values.A given posterior distribution yields a distribution of R 2 statistics: in this work, we report the median of the resultant distribution.Similarly to the frequentist R 2 statistic, a Bayesian R 2 statistic of unity indicates that the model captures all data variability (i.e., the variance of residuals, Var(V fit − V obs ) is zero), while a Bayesian R 2 statistic of zero indicates that all fitted values lie on a horizontal line (hence, we expect low R 2 statistics for poor responders).

Generation of synthetic patient data
We generate synthetic patient data by resampling parameters from the full posterior and expos- to each class is shown in Table 1.
First, it is evident from results in Fig. 4 that the predicted value of the initial necrotic proportion, ϕ 0 , does not vary between fast and poor responders.This is seen in bivariate denisties between ϕ 0 and all other parameters.The statistic does, however, appear to distinguish pseudo-progressors from the other response types: estimates for ϕ 0 suggest that tumours in such patients contain a much larger necrotic region pre-treatment.Faster responders are characterised in relation to poor responders by both a higher radiotherapy sensitivity, γ, and necrotic material decay rate, ζ.The necrotic material decay rate also appears to distinguish poor, fast, and plateaued responders: poor responders through a very low decay rate, plateaued responders by a high decay rate, and fast responders an intermediate rate.Finally, results in Fig. 4 suggest that pseudo-progressors are characterised by both a high cell proliferation rate and correspondingly high radiotherapy response.

Model predictions
Given that the training set is relatively small, a potential obstacle is that responses of new patients may not be similar enough to those of existing patients to produce reliable predictions; indeed only 6.0% of posterior samples correspond to patients that exhibit a poor response to treatment.To address this with the existing data, we "expand" the full posterior to form the second-level prior, p 2 (θ), by resampling and perturbing (essentially, forming p 2 (θ) as a multivariate kernel density estimate based on the full posterior, with a kernel variance expanded from Silverman's rule to account for new patient dissimilarity).The updated proportions, based on 100,000 samples from p 2 (θ), are given in Table 1, and suggest an updated prior probability of a new patient exhibiting a response at 65.3%.An alternative approach that is beyond the scope of the current work would be to stratify perturbed full posterior samples based on an external and accepted classification ratio: for example, to choose the prior weights {w i } to achieve a desired prior ratio of patients in each classification.These results highlight the difficulty of classifying patient outcomes based on a relatively small cohort of patients with little prior parameter knowledge.Using the first-level prior (i.e., excluding all knowledge gained through analysis of the training data) further reduces the prior probability of an eventual response to 44.7%.
We first assess the predictive ability of our trained model by generating data from four synthetic patients exhibiting a fast response (Fig. 5a); a poor response (Fig. 5b); a plateaued response (Fig. 5c); and pseudo-progression (Fig. 5d).Given that each set of patient-specific parameters is resampled from the full posterior, we expect each synthetic patient to display a similar response to at least one patient in the training set.Additionally, as each set of synthetic data is generated by the mathematical model, we are guaranteed that the observed response is within the possible gamut of model responses.We provide a table summarising the parameter values used for each patient in the supplementary material (Table S1).
In Fig. 5 we simulate real-time predictions by calibrating and forming predictions each week throughout treatment (i.e., at the time of each weekly CT scan).We show predictions made at the start of treatment (t = 14 d), and every second week following (t = 28 d, 42 d and 56 d).
The results shown for t = 56 d correspond to retrospective analysis of the trajectory, after all measurements have been taken, while the predictions drawn at t = 14 d are made pre-treatment, before any radiotherapy response has been observed.As a class under-represented in the data set and hence the prior, predictions made for the pseudo-progressor at t = 28 d almost entirely miss the true trajectory.Consequently, the single data point at t = 28 d that sees a decrease is judged alongside both prior knowledge and potential measurement noise.
To quantitatively compare the time-evolution of prediction confidence, we plot in Fig. 6a-d the evolution of posterior information relating to the radiotherapy response, γ, and in Fig. 6e-h the time evolution of predicted final tumour volume (i.e., the fold-change GTV at t = 56 d compared to the measurement at t = 0 d).The most immediate result is that both the fast responders and pseudo-progressors yield a posterior density for γ higher than that for the poorresponders.The results in Fig. 6e show that the predicted final GTV quickly narrows around the true value for the fast responder, but takes longer for the plateaued progressors and pseudoresponders.At the same time, the results in Fig. 6f show that by two weeks into treatment, the model predicts with 95% confidence that a patient will not see a final GTV less than 50% of that pre-treatment.The results in Fig. 6g highlight again the difficulties faced when drawing predictions for patients exhibiting relatively rare responses: working with synthetic data eliminates the question of model-misspecification, however the 95% credible intervals produced from predictions drawn at t = 21 d and t = 28 d do not cover the true value (which can be calculated by resimulating data from each synthetic patient without measurement noise).
Given GTV alone, it is not until t = 42 d (four weeks into treatment) that the model predicts with 95% confidence that the patient's tumour will eventually see a reduction in volume.This is in line with previous reports that mid-treatment responses correlate with outcome [15].
Mean 50% CrI 95% CrI Figure 6.Predictions for four synthetic patients.For the four patients analysed in Fig. 5 we show (a-d) the evolution of the posterior distribution relating to radiotherapy response, γ; and (e-h) evolution of predictions for the relative tumour volume at the conclusion of treatment.In all cases, data up to, and including, the relevant time are included in the prediction.In (e-h), we show the mean (black disc), and both 50% and 95% credible intervals for the final tumour volume, together with the true final tumour volume (red dashed), both given as the fold-change (FC) relative to the initial volume, V (0).The true values of γ used for each patient are given as supplementary material (Table S1).

Clinical data
Now that we have validated the model's ability to predict the time course of GTV for synthetic patients with a variety of radiotherapy responses, we turn to focus on drawing real-time predictions from unseen clinical data.
In Figs. 7 and 8, we repeat the analysis performed in Figs.response, the model predicts with 95% confidence that the patient will eventually achieve an overall reduction in tumour volume.Indeed, for both of these patients the precision in predictions of the final tumour volume narrows quickly around what is eventually observed.In contrast, at day 28 the patient that eventually exhibits a poor response sees roughly half of all predicted trajectories indicating an eventual increase in volume, and half a decrease.Throughout treatment, the mean prediction remains around the eventually observed value of unity.The results for the pseudo-progressor mirror those observed in the synthetic data: the predictions are perhaps initially misleading due to the relatively small (2.3%) prior probability of a patient exhibiting such a response.
To quantitatively explore the model's ability to predict patient classification, in Fig. 9 we plot the posterior classification probabilities for predictions drawn at each time point, in addition to a pooled classification probability of a patient displaying a response (i.e., not a poor responder).Initially, at t = 0 d, the classification probabilities represent those in the secondlevel prior, p 2 (θ) (Table 1).The most notable results are for the relatively rare classifications of plateaued response and pseudo-progressor.In the case of the former, the patient has a posterior classification mode (i.e., the most likely classification given all the information collected during the patients' course of treatment) of a fast responder.This again highlights the difficulties distinguishing plateaued responses from observation noise seen in faster responders.
The pseudo-progressor, however, begins to gain a correct posterior classification probability by t = 42 d, just over four weeks into treatment.The classifications following the first measurement at t = 14 d are qualitatively similar to that observed in the prior, subsequent measurements which show an increase in gross tumour volume lead to classification as a poor responder, highlighting the limitations of the currently trained model in distinguishing pseudo-progressors from poor responders.
To explore the relative value of existing and newly collected information, in the supplementary material we produce additional results that show temporal predictions for both synthetic and validation patients, produced using the uninformative (i.e., first-level) prior.These results correspond to a prior probability of 44.7% that a patient will eventually respond to treatment; much lower than that estimated from analysis of the training data (94.0%) and that in the second-level prior (65.3%).For the synthetic patients presented in Fig. 5, the results show a decrease in prediction fit (as measured by Bayesian R 2 ) for predictions drawn prior to t = 28 d.
For times later than t = 42 d, predictions drawn using both the uninformative and informative priors are comparable.Similar results are seen for the validation patients presented in Fig. 7, although the differences are less pronounced from the third-post-radiotherapy observation point onwards.The difference between results for the synthetic and validation patients is expected: the informative prior is known to be representative for the synthetic patients, whereas we do not have this guarantee for the validation patients.Hence, observed information is more important than prior information in newly informed patients that are not well-represented by the prior.

Value in collecting measurements of tumour heterogeneity
The weekly GTV used for our analysis already exceeds clinical practice of just two pretreatment CT scans per patient.To assess the potential value of collecting higher-quality scan data that additionally enables identification of the tumour's necrotic volume, we repeat our analysis of the synthetic patient in Fig. 5a given that noisy measurements of both V (t) and N (t) are now available.The results in Fig. 10a,b show that, by day 28, relatively precise predictions relating to the trajectories of both variables can now be made.In Fig. 10c we quantitatively compare predictions for the final GTV in both scenarios.As expected, more precise estimates can be made should data relating to both variables be available.
In Fig. 11, we repeat the analysis for two new synthetic patients that experience a poor response.In the case of the first patient, a small gain in predictive ability is seen from the inclusion of necrotic volume measurements (Fig. 11c); interestingly, this improvement is not seen for the second patient (Fig. 11f).Overall, these results highlight a key challenge with using the population-calibrated mathematical model to draw predictions relating to tumour composition and the underlying cause of a poor response, particularly given the wide-ranging spatial compositions seen in poor responders.The first synthetic patient exhibits a poor response due to the development of a tumour comprising almost entirely necrotic material, which does not degrade (Fig. 11b), while the tumour composition in the second synthetic patient is perhaps more realistic, with the necrotic fraction comprising approximately 60% of the GTV at the end Figure 9. Classification of the four patients excluded from the training set.We predict each patient's classified response using data up to and including the relevant time (height of each region indicates the predicted proportion).The predicted probability of the patient responding (i.e., receiving a classification that is not that of a poor responder) is shown in black dashed.Before the start of treatment, the predicted classifications correspond to those of the second-level prior in Table 2.
of treatment.(Fig. 11e).Since the model is not trained using clinical data relating to tumour composition, it cannot distinguish between tumour compositions that are clinically realistic and those that are not.This is not an issue for prediction of the GTV, as prediction uncertainty incorporates all possible tumour compositions through prior knowledge.Predictions of necrotic volume, meanwhile, represent predominantly prior knowledge in addition to restrictions imposed by the modelled relationship between the observed GTV of patients in the training set and their potential inner tumour composition.

Conclusion
The development of predictive mathematical models of patient-specific tumour response is hindered by multiple challenges.Mathematical models must incorporate sufficient detail to capture a wide range of potential responses, while clinical data are highly limited, often comprising just one or two noisy measurements of tumour volume prior to treatment initiation.Advances in imaging technologies or the use of magnetic resonance imaging embedded in radiation delivery devices may, in future, provide a cost-effective means of collecting more detailed information, allowing the calibration of correspondingly more detailed mathematical models [54][55][56][57].In this work, however, we work with a fundamental set of measurements, and present the statistical methodology and an appropriately complex mathematical model to maximise data utility and draw clinically relevant predictions by leveraging a cohort of patients that exhibit a variety of treatment responses.
Importantly, the two compartment model is able to reproduce the full range of patient responses observed in our cohort of clinical data, representing an improvement over previously proposed one-compartment models which may not capture more complex behaviours, such as the plateaued response and pseudo-progressor behaviour.This is particularly important for prediction, since the choice of model and gamut of possible responses form a significant part of prior knowledge.While the mathematical literature presents an extensive catalogue of more complex models, we find that our choice of model with six unknown parameters, all with a direct biophysical interpretation, is simultaneously both sufficiently simple to ensure practical identifiability in some cases, and sufficiently complex to produce the variety of responses seen in the clinical data.Parameter identifiability is clearly not essential to produce predictions (single patient predictions drawn early in the course of treatment from the first-level prior, where the number of parameters exceeds the number of data points, are still sensical), however the relatively small parameter space and resultant tightly constrained second-level prior (Fig. 4) ensures adequate coverage in our resampling-based inference method: we expect our approach to become prohibitively expensive for models with large numbers of parameters.
The overarching goal of the presented framework is to leverage existing clinical data to produce a predictive model for GTV that accurately captures the uncertainty in predictions made for new patients.By benchmarking against both synthetic and a validation clinical data set, we show that our approach excels at this goal for patients with more typical responses: the fast and poor responders.Given the relatively small size of our training data-comprising measurements from 40 patients-it is no surprise that our approach does not perform as well for patients with atypical responses: pseudo-progressors, for instance, make up only 2.3% of the prior, meaning that the GTV progression of these patients is informed by (on average) a single patient in the training set.In this case, it takes six on-treatment measurements before the patient is identified as more likely to exhibit an eventual response than a poor response.
The most effective remedy would be to accumulate significantly more clinical data with better representation of outliers.Should enough data become available, stratification could be used to ensure that representation of patients in the training data either concords with that in the population, or incorporates non-quantitative prior knowledge (such as patient characteristics) that pre-inform similarities with patients in the training set.Our modelling framework is wellpoised to incorporate more detailed clinical data, including, for instance, radiotherapy plan We highlight that, in general, our statistical methodology is entirely model agnostic.Thus, informed by more detailed data, our approach could be used to develop a fully validated predictive model of not just GTV, but tumour composition, cell density, proliferation, hypoxia, and more.However, this proposition is not without limitation: our current choice to bootstrap parameter samples is likely to perform poorly for models with a large number of parameters.Such dimensionality-induced issues can be in part alleviated by sampling the full posterior directly, although this would introduce additional computational challenges.Further statistical developments are also needed to include parameters that are fixed between patients (for example, the noise parameters), or parameters that are assumed to be uncorrelated to others.
Our results add to a growing body of work [60][61][62][63] that highlights the utility that mathematical models could bring to the clinic; in future informed by highly detailed and representative patient data to provide objective, real-time, and personalised patient predictions that inform clinical decision-making.

Supplementary material for
"Predicting radiotherapy patient outcomes with real-time clinical data using mathematical modelling" Alexander P Browning 2.9 × 10 −1 * These authors contributed equally.† These authors also contributed equally.‡ Corresponding author: browning@maths.ox.ac.uk or HEnderling@mdanderson.org 1 arXiv:2201.02101v3[q-bio.TO] 13 Dec 2023 0 14 28         Figure S8.Temporal predictions from four patients excluded from the training set using the uninformative prior.We reproduce the analysis from Fig. 7 in the main document using the uninformative (i.e., population-level) prior.

Figure 1 .
Figure 1.Gross tumour volume (GTV) measurements taken during radiotherapy.(a) Example CT scan of an oropharyngeal cancer patient throughout treatment, showing tumour contoured in blue (data courtesy of CD Fuller and MD Anderson).(b-e) Clinical data representing four qualitatively different radiotherapy results.Predictions from the mathematical model, along with a 95% credible interval for the modelled observation means, are shown in purple.In all cases, the radiotherapy schedule starts at the time of the second observation.The four patients shown are excluded from the training data analysed in later parts of our study.The units of GTV are given as the fold change (FC) relative to the initial volume.
Fig. 1b-e as they undergo their course of treatment.

( a )
Pseudo-progressor.A second (noise-free) measurement greater than 102% of the first following radiotherapy onset.(b) Plateaued response.Not a pseudo-progressor, with a final measurement greater than 20% of the initial, and with a final rate-of-change less than 10% of the maximum rate-of-change observed.(c) Fast responder.Not in any other classification.

Figure 3 .
Figure 3. Parameter posteriors from analysis of training data.first-level prior distribution (blue) and full posterior (purple) following analysis of the training data.The first-level prior, p 1 (θ), comprises independent uniform distributions in the log of each unknown parameter.Parameters relate to the cell proliferation rate, λ, the carrying capacity, K, the radiotherapy response strength, γ, the decay rate of necrotic debris, ζ, the cell necrosis rate, η, and the initial proportion of the population that is necrotic, ϕ 0 .

Figure 4 .
Figure 4. Parameter clustering according to classified patient response.Kernel density of the full posterior distribution, following analysis of the training data set.Samples are classified into one of four patient responses according to criteria set out in Section 2.2.1, and kernel density estimates of bivariate marginal distributions conditioned on each classification shown.To aid comparison in the vicinity of the mode of each conditional posterior, only regions with densities greater than 50% of the maximum are shown.Parameters relate to the cell proliferation rate, λ, the carrying capacity, K, the radiotherapy response strength, γ, the decay rate of necrotic debris, ζ, the cell necrosis rate, η, and the initial proportion of the population that is necrotic, ϕ 0 .

5 and 6 for the four patients initially exhibited in Fig. 1 .Figure 7 .
Figure 7. Temporal predictions for the four patients excluded from the training set.We reproduce the analysis from Fig. 5 for the four patients in Fig. 1.These patients were not included in the training set, and so these results are representative of clinical predictions made throughout a new patient's course of treatment.Patients were classified previously as (a) a fast responder; (b) a poor responder; (c) exhibiting a plateaued response; and, (d) exhibiting pseudo-progression.Results related to the remaining seven patients excluded from the training set are given in the supplementary material (Fig. S6).

Figure 8 .
Figure 8. Predictions for the four patients excluded from the training set.We reproduce the analysis from Fig.6for the four patients in Fig.1b-e.These patients were not included in the training set, and so these results are representative of clinical predictions made throughout a new patient's course of treatment.

Figure 10 .
Figure 10.Predictions for a synthetic patient with a fast response subject to both GTV and necrosis measurements.(a-b) We reproduce the analysis from Fig.5ain the case that information relating to both V (t) and N (t) is available.(c) Mean, 50%, and 95% credible intervals for the final GTV in both data collection scenarios.The true value (calculated by resimulating data from each synthetic patient without measurement noise) is also shown (red dashed).Lower plot in (c) is a cropped inset of the upper.

Figure 11 .
Figure 11.Predictions for two synthetic patients with a poor response subject to both GTV and necrosis measurements.(a-b,d-e) We produce dynamic predictions of tumour progression for each patient in the case that information relating to both V (t) and N (t) is available.(c,f) Mean, 50%, and 95% credible intervals for the final GTV in both data collection scenarios.The true value (calculated by resimulating data from each synthetic patient without measurement noise) is also shown (red dashed).Lower plot in each set is a cropped inset of the upper.
adaptation and information relating to variations in delivered dose throughout the course of treatment.Inclusion of such information is likely to lead to better response classification, particularly if the radiotherapy dose is modified during the course of treatment.Both the accuracy and precision of predictions could also be improved for all patients through a better biological understanding of radiotherapy response.The final set of results presented in this work highlight that GTV measurements alone are insufficient to identify the root cause of a poor response.Indeed, predictions related to the inner tumour composition must be treated with as much caution as with predictions for atypical patients that are dissimilar to all patients in the training set.The absence of tumour composition data in the training set means that all predictions of tumour composition are only informed by data indirectly through the model, which has, in turn, been validated against solely GTV data.The prospect of training a model with joint GTV-composition measurements is at present hypothetical, although entirely possible through advanced imaging technologies[58][59][60].At this stage, our framework could additionally be applied to answer important questions relating to the number of tumour composition measurements required to accurately predict patient outcome throughout their course of treatment.

50 Figure S3 .
Figure S3.Fits for patients in the training set.Individual fits for all patients in the training set, using the population-level prior.Shown are the data (black disc), 50% credible interval (light grey), and 95% credible interval (dark grey).Note that model predictions are only drawn at times corresponding to clinical measurements: results for intermediate time points show as a linear interpolation.

Figure S4 .
Figure S4.Subset of classified prior samples.We demonstrate the choices in the classification algorithm by simulating eight patients of each class from the prior distribution.All patients undergo the standard course of treatment.Shown is the tumour volume (black), a horizontal line indicating unity (dashed grey), and a vertical line showing the time of the first radiotherapy dose (dashed grey).

Figure S5 .FigureFigure S7 .
Figure S5.Results for eight additional synthetic patients.We reproduce the analysis from Fig. 5 in the main document for eight additional synthetic patients (two from each response classification).
Figure2.Pseudo-hierarchical approach used in the analysis.Devoid of any data, knowledge about model parameters is encoded in the "first-level prior", denoted by p 1 (θ), and is used to individually form a set of posterior distributions for patients in the training set.The "second-level prior", denoted by p 2 (θ), represents knowledge gained from analysis of the training set and is used to form posterior distributions for new patients, denoted by p new (θ|D new ).In effect, the approach identifies patients in the training set with possibly similar outcomes to new patients.
1 we describe and present the clinical data set used for quantitative analysis and which demonstrate four disparate treatment response classifications.Secondly, in Section 2.2 we present a mechanistic mathematical model of tumour volume progression, along with a set of objective criteria that we use to classify model realisations into the four observed classifications.Subsequently, in Section 2.3 we present a statistical model that connects model predictions to clinical measurements.In Section 2.4

Table 2 .
Parameters and first-level prior distributions.The description relates to the exponentiated log parameter.

Table S1 .
* ‡1 , Thomas D Lewin*1,2, Ruth E Baker 1 , Philip K Maini 1 , Eduardo G. Moros 3 , Jimmy Caudell 3 , Helen M Byrne †1 , and Heiko Enderling †3,4,5 1 Mathematical Institute, University of Oxford, Oxford, UK 2 Roche Pharma Research and Early Development, Roche Innovation Center, Basel, Switzerland 3 Department of Radiation Oncology, H. Lee Moffitt Cancer Center & Research Institute, USA 4 Department of Integrated Mathematical Oncology, H. Lee Moffitt Cancer Center & Research Institute, USA 5 New Address: Department of Radiation Oncology, MD Anderson Cancer Center, Houston, TX, USA Parameters resampled from the full posterior corresponding to four synthetic patients.
Parameter Fast responder Poor responder Plateaued responder Pseudo progression λ

Table S2 .
R MCMC diagnostic statistics for each patient.