Graphical models
To help clarify the ideas discussed in this paper, it is convenient to start with a graphical representation of the structural assumptions relating the quantities in the PKPD model. Graphical models have become increasingly popular as ‘building blocks’ for constructing complex statistical models of biological and other phenomena [22]. These graphs consist of nodes representing the variables in the model, linked by directed or undirected ‘edges’ representing the dependence relationships between the variables. Here we focus on graphs where all the edges are directed and where there are no loops (i.e. it is not possible to follow a path of arrows and return to the starting node). Such graphs are known as Directed Acyclic Graphs (DAGs) and have been extensively used in modelling situations where the relationships between the variables are asymmetric, for example from cause to effect.
Figure 3a shows the generic situation we are interested in here. Unobserved variables are denoted by circular nodes while observed variables (i.e. the data) are denoted by square nodes. Arrows indicate directed dependencies between nodes. The model shown relates a response z to predictors x through parameters β, but where x is not directly observed. Instead we have observations y which depend on x through an assumed measurement error model. Note that, in order for the graph to represent a full joint probability distribution, we also assume that the unknown quantities at the top of the graph (i.e. β and x) are given appropriate prior probability distributions (usually chosen to be minimally informative); however, for clarity we suppress these dependencies in the graphical representation.
When considering the flow of information in a statistical model and the influence that one variable has on another, it is helpful to identify the conditional independence assumptions represented by the graph. In a directed acyclic graph, it is natural to draw analogies to family trees and refer to the ‘parents’ \(\hbox{pa}[v]\) of a node v as the set of nodes that have arrows pointing directly to v, ‘children’ \(\hbox{ch}[v]\) as nodes at the end of arrows emanating from v, and ‘co-parents’ as other parent nodes of a child. (The terms ‘descendants’ and ‘ancestors’, etc., then have obvious definitions.) A DAG expresses the assumption that any variable is conditionally independent of all its ‘non-descendants’, given its ‘parents’. If we wish to define a joint distribution over all variables, V, say, in a given graph, such independence properties are equivalent [23] to assuming
$$ p(V)=\prod_{v \in V} p(v|{\rm pa}[v]), $$
(1)
that is, the joint distribution is the product of conditional distributions for each node given its parents.
The MCMC sampling-based algorithms used for Bayesian inference are able to exploit conditional independencies in the model in order to simplify computations. For example, the Gibbs sampler [11, 12] works by iteratively drawing samples for each unknown variable v from its full conditional distribution given (current) values for all the other variables in the model, which we denote \(p(v|V{\backslash}v)\). Now \(p(v|V\backslash v) \;(\propto p(V))\) only depends on terms in p(V) that contain v, which from Eq. 1 means that
$$ p(v|V\backslash v) \propto p(v|\hbox{pa}[v]) \prod_{w \epsilon \hbox{ch}[v]} p(w|\hbox{pa}[w]); $$
hence the full conditional distribution is proportional to the product of a ‘prior’ term \(p(v|\hbox{pa}[v])\) and a likelihood term \(p(w|\hbox{pa}[w])\) for each child w of v. In particular it is clear that the sampling of each variable depends only on the current values for its parents, children and co-parents in the graph. This simple result forms the foundation for the efficient algorithms implemented in the WinBUGS software [5], and in turn provides an easy rule for adapting MCMC algorithms to sample from the adjusted posterior distributions we describe below for sequential PKPD analysis.
Cutting the influence of children on their parents
Suppose we are interested in inference about β in the model represented in Fig. 3a. The correct posterior distribution \(p(\beta|y,z)\) can be written
$$ p(\beta|y,z) = \int p(\beta|x,z) p(x|y,z) \, dx $$
since β is independent of y given z (its child) and x (z’s co-parent). This means that the response data z are taken into account when making inferences about x, which subsequently influence the inferences about β. If the data z are more substantial than y, the estimation of x will tend to be dominated by the response model rather than the measurement error model. This may be particularly unattractive if we have greater confidence in the measurement error model, say through biological rationale for its functional form, while the response model is chosen more for convenience.
If we wanted to avoid the influence of z on the estimation of x, we would need to prevent or cut the feedback from z to x, allowing x to be estimated purely on the basis of y. This leads to the graphical model shown in Fig. 3b, which treats x as a parent of z but does not consider z to be a child of x. We denote this one-way flow of information by the ‘valve’ notation shown in the figure. When performing MCMC, all we have to do to prevent feedback is avoid including a likelihood term for z when sampling x. For example, if using Gibbs sampling, the conditional distribution used to generate values of x will not include any terms involving z.
Figure 4 shows the PD fit obtained when feedback is cut between the PD data (equivalent to z) and the PK parameters (equivalent to x) in the example discussed in sect. “Introduction”. There is perhaps some underprediction towards the upper end of the time scale but the fit is by no means inadequate, nor is it necessarily inferior to that from the full model (Fig. 2b). Indeed, since the PK fit for this ‘cut’ model is as for the original PK-only model shown in Fig. 1, one could argue that the pair of fits in Figs. 1 and 4 is preferable to the pair in Fig. 2.
We emphasise that our conclusions based on this approach no longer arise from a full probability model, and as such disobey basic rules for both likelihood and Bayesian inference. However, as discussed further in sect. “Discussion”, we may perhaps view this procedure as allowing a more ‘robust’ estimate of x that is not influenced by (possibly changing) assumptions about the form of the response model.
Population PKPD modelling
Following the notation in ZBS, we let y
i
and z
i
denote the vectors of observed concentrations and observed effects for individual \(i =1,\ldots,N\), with y and z denoting the collections of all PK and PD data across subjects. The (typically vector-valued) PK and PD parameters of the ith individual are denoted θ
i
and ϕ
i
, respectively, with θ and ϕ denoting the sets of all PK and PD parameters. Finally the PK and PD population parameters are denoted Θ and Φ respectively. Typically, the inter-individual distributions of the PK and PD parameters are assumed to be independently multivariate normal, so that \(\Uptheta=(\mu_{\theta}, \Upsigma_{\theta})\) and \(\Upphi=(\mu_{\phi}, \Upsigma_{\phi})\) are the population means and covariances of the individual-level PK and PD parameters respectively, although the following discussion is general and applies to all distributional assumptions.
ZBS consider a variety of models for estimating the PKPD parameters in the above set-up. Their simultaneous approach (SIM) corresponds to specifying a full probability model for the PKPD data, and the graph corresponding to such a model is shown in Fig. 5. Unlike the previous graphs, Fig. 5 represents a hierarchical model. The large rectangle (known as a ‘plate’) denotes a repetitive structure—that is, the nodes enclosed within the plate are repeated for each subject \(i=1,\ldots,N\) in the study. Nodes outside the plate are common to all subjects, and represent the population PK and PD parameters in this context. The arrow linking θ
i
to z
i
in Fig. 5 represents the assumption that the PD responses, z
i
, depend on the true PK responses (drug concentrations), where the latter are modelled as a deterministic function of the PK parameters \(f_{PK_i}=f(\theta_i)\). Note that dependence of the true and observed concentrations on quantities fixed by the study design, e.g. the dose and measurement times, is suppressed for notational clarity, as are error terms and other nuisance parameters.
From Eq. 1, the joint distribution of all the data and parameters of the model represented by Fig. 5 can be written
$$ p(y,z,\theta,\phi,\Uptheta,\Upphi) = \left[ \prod_i p(z_i|\theta_i,\phi_i) p(y_i|\theta_i)p(\theta_i|\Uptheta)p(\phi_i|\Upphi) \right] p(\Uptheta) p(\Upphi). $$
The key distinction between the inferential approaches we consider lies with the estimation of the θ
i
’s. Note that the posterior distribution for θ could be written
$$ p(\theta | y, z) \propto p(z | \theta) p(\theta | y), $$
emphasising that the simultaneous posterior for θ conditional on both the PK and PD data is equivalent to having sequentially estimated the posterior for θ conditional only on the PK data, and then having used this posterior as the prior on θ for the second part of the analysis, conditioning on the PD data.
As noted previously, ZBS consider three alternative sequential approaches. The first of these, which they term PPP (Population PK Parameters) estimates the θ
i
’s on the basis of an estimate of Θ and the PD data alone: hence the θ
i
’s are influenced by the PK data only through estimation of Θ. Within our framework, this model may be represented by the graph shown in Fig. 6. The \(\theta_i^{\rm PK}\) nodes denote values of the PK parameters for subject i derived from the PK data alone, while the θ
i
’s are those parameters used to obtain the true concentrations \(f_{PK_i}=f(\theta_i)\) for the PD model: the ‘cut’ ensures there is no influence of the PD data on the estimation of Θ. One way to interpret the cut is to imagine the PK data were analysed alone, giving rise to posterior distributions (given y) for both \(\theta^{\rm PK}\) and Θ. Then each subject is treated as a ‘new’ individual and a posterior-predictive distribution is derived for his/her PK parameters. These predictive distributions then represent the priors for the subject-specific PK parameters in an analysis of the PD data alone. Our proposed framework allows all of this is to be performed simultaneously, however, as opposed to sequentially.
It is important to note that we do not know how to write down the resulting joint ‘posterior’ here! Our approach allows us to sample from it but we do not know its mathematical form (that is not to say that such a form doesn’t exist). Hence we cannot compare it analytically with the correct posterior.
ZBS refer to the second of their sequential methods as PPP&D (Population PK Parameters and Data), which now estimates the individual PK parameters on the basis of both PK and PD data, but the PK population parameters using only the PK data. This method corresponds to the graphical model in Fig. 7. The node y
copy denotes a duplicate copy of the PK data, and emphasises the point made by ZBS that in the PPP&D method, the PK data are used twice (see below for further discussion of the implications of this). Interpretation of the cut is as before, with the posterior-predictive distribution (given y alone) forming the prior for each subject-specific PK parameter vector used in the PD analysis, except that now the PK parameters used in the PD analysis are influenced both by z and by the duplicated PK data. Re-use of the PK data might, at first glance, suggest spurious precision. However, this is not the case: by cutting the feedback between the θ
i
s and Θ, \(y^{\rm copy}\) is used to inform about the subject-specific PK parameters (in the PD analysis) without providing further spurious information about the population PK parameters.
The final method considered by ZBS is called IPP (Individual PK Parameters), in which the subject-specific PK parameters do not depend on the PD data at all. The graphical model corresponding to this method is shown in Fig. 8. Unlike the previous two sequential methods, the true concentrations needed for the PD model are derived from the original subject-specific PK parameters estimated from the PK model. In this case, the distribution of the individual θ
i
s used in the PD analysis cannot be thought of as a prior, since it is fixed and hence not updated in the PD analysis. Rather, we think of specifying θ as a ‘distributional constant’ in the PD analysis, which has the flavour of multiple imputation. Note that in the case of the PPP and PPP&D sequential models, it is Θ that is specified as a distributional constant. Table 1 summarizes the key differences between the four models in terms of how they condition on the ‘data’ in order to estimate the population and subject-specific parameters. As the nodes from which cut valves emanate can be interpreted as ‘distributional constants’ (i.e. fixed distributions that are not updated), we consider the corresponding distributions as ‘data’ here.
Table 1 Four different population PKPD models and the data on which their parameters depend
Implementing cuts in the BUGS language
Figure 9 depicts how cut valves are implemented in the BUGS language. Suppose we have two stochastic nodes A and B and we want to allow B to depend on A but prevent feedback from B to A (hence in observing B, our belief about A is unchanged). We introduce a logical (deterministic) node C, which is essentially equal to A but which severs the link between A and C as far as any children of C are concerned. Node B then becomes a child of C as opposed to A, and so no feedback can flow from B to A, while C provides the same information as A when it acts as a parent of B. In the BUGS language we might write, for example:
$$ \begin{aligned} &{\tt A} \sim {\tt dnorm(mu, tau)}\\ &{\tt C} <{\!\!-} {\tt cut(A)}\\ &{\tt B} \sim {\tt dnorm(C, gamma)} \end{aligned} $$
where \({\tt cut(.)}\) is a logical function taking the same value as its argument (but which does not allow information to flow back towards its argument).