1 Introduction

Mathematical models are used in many areas of study to provide accurate quantitative predictions of biological phenomena. When models are used in safety-critical settings (such as drug safety or clinical decision-making), it is often important that our models produce accurate predictions over a range of scenarios, for example, for different drugs and patients. Perhaps more importantly, these models must allow a reliable quantification of confidence in their predictions. The field of uncertainty quantification (UQ) is dedicated to providing and communicating appropriate confidence in model predictions (Smith 2013). Exact models of biological phenomena are generally unavailable, and we resort to using approximate mathematical models instead. When our mathematical model does not fully recapitulate the data-generating process (DGP) of a real biological system, we call this model discrepancy or model misspecification. This discrepancy between the DGP and our models presents a particular challenge for UQ.

Often, models are trained using experimental data from a particular experimental design, and then used to make predictions under (perhaps drastically) different scenarios. We call the set of experimental designs under consideration the design space and denote it \({\mathcal {D}}\). We assume the existence of some DGP, which maps each element of \(d \in {\mathcal {D}}\) to some random output. These elements are known as experimental designs, or, as is more common in electrophysiology, protocols, and each corresponds to some scenario that our model can be used to make predictions for. Namely, in Sect. 1.1, each protocol, \(d\in {\mathcal {D}}\), is simply a set of observation times. By performing a set of experiments (each corresponding to a different protocol \(d\in {\mathcal {D}}\)) we can investigate (and quantify) the difference between the DGP and our models in different situations. When training our mathematical models using standard frequentist or Bayesian approaches, it is typically assumed that there is no model discrepancy; in other words, that the data arise from the model (for some unknown, true parameter set). This is a necessary condition for some desirable properties of the parameter estimators which provide some guarantees regarding the accuracy of parameter estimates when there is a large number of observations, as discussed in Sect. 2.1.3. However, when model discrepancy is not considered, we can find that the ability of a model to make accurate predictions is compromised. In particular, if we try to validate our model with a protocol dissimilar to that used for training, there can be a noticeable difference between our predictions and the data—even when the model appears to fit the training data well. A simple illustration of this problem is introduced in the following section.

1.1 Motivating Example

In this section, we construct a simple example where we train a discrepant model with data generated from a DGP using multiple experimental designs. This example demonstrates that it is important to consider the protocol-dependence of parameter estimates and predictions when using discrepant models.

First, we construct a DGP formed of the sum of two exponential terms,

$$\begin{aligned} y^*\!(t)&= \exp \left\{ -t\right\} + \exp \left\{ -\frac{t}{10}\right\} , \end{aligned}$$
(1)
$$\begin{aligned} z^*\!(t)&= y^*\!(t) + \varepsilon (t), \end{aligned}$$
(2)

for some \(t>0\) where \(\varepsilon (t)\) is an independent Gaussian random variable, each with zero mean and variance, \(\sigma ^2 = 10^{-4}\) for each \(t>0\). Here, \( z^*(t)\) is a random variable representing an observation of the system at some time, \(t\).

Next, we attempt to fit a model which takes the form of single exponential decay,

$$\begin{aligned} y(t; {\varvec{\theta }})&= \theta _1 \exp \left\{ -\frac{t_i}{\theta _2}\right\} , \end{aligned}$$
(3)
$$\begin{aligned} z(t; {\varvec{\theta }})&= y(t; {\varvec{\theta }}) + \varepsilon (t_i), \end{aligned}$$
(4)

to these data, denoting the column matrix \([\theta _1, \theta _2]^T\) by \({\varvec{\theta }}\). We call this a discrepant model because there is no choice of \({\varvec{\theta }}\) such that \(y(t; {\varvec{\theta }}) = y^*\!(t)\), for all \(t > 0 \).

To train our model, we choose a set of \(n\) observation times, \(T = \{t_1,\, t_2,\, \ldots \, t_n\}\). We may then find the parameter set, \(\hat{\varvec{\theta }}(T)\), which minimises the sum-of-squares error between our discrepant model (Eq. 4) and each \(z({\varvec{\theta }}; t_i)\), that is,

$$\begin{aligned} \hat{\varvec{\theta }}(T) = \text {argmin}_{{\varvec{\theta }}\in \Theta }\left\{ \sum _{t_i\in T} \big (y(t_i; {\varvec{\theta }}) - z^*\!(t_i)\big )^2\right\} , \end{aligned}$$
(5)

where T is a set of observation times.

Then, we consider multiple experimental protocols which we may use to fit this model (Eq. (4)). In particular, we consider the following sets of observation times

$$\begin{aligned} T_1&= \left\{ 0,\, 0.01,\, 0.02,\, \ldots , 0.01\right\} , \end{aligned}$$
(6)
$$\begin{aligned} T_2&= \left\{ 0,\, 0.1,\, 0.2,\, 0.3,\, \ldots ,\, 1\right\} , \end{aligned}$$
(7)
$$\begin{aligned} T_3&= \left\{ 0.2,\, 0.3,\, 0.4,\, 0.5,\, \ldots ,\, 1.2\right\} , \end{aligned}$$
(8)
$$\begin{aligned} T_4&= \left\{ 0.5,\, 0.55,\, 0.6,\, \ldots ,\, 1 \right\} , \;{\text {and}} \end{aligned}$$
(9)
$$\begin{aligned} T_\text {all}&= T_1 \cup T_2 \cup T_3 \cup T_4. \end{aligned}$$
(10)

We sample from the DGP \(10\) times by computing Eq. 2 for each observation time, \(t\), and adding IID Gaussian noise. Then, for each sample of the DGP, we compute parameter estimates using each set of observation times (\(T_1, T_2, T_3, T_4\) and \(T_\text {all}\)). This process is then repeated with a ten-fold increase in sampling rate, that is, with observation times,

$$\begin{aligned} T_1'&= \left\{ 0,\, 0.001,\, 0.002,\, \ldots , 0.01\right\} , \end{aligned}$$
(11)
$$\begin{aligned} T_2'&= \left\{ 0,\, 0.01,\, 0.02,\, 0.03,\, \ldots ,\, 1\right\} , \end{aligned}$$
(12)
$$\begin{aligned} T_3'&= \left\{ 0.2,\, 0.21,\, 0.22,\, 0.23,\, \ldots ,\, 1.2\right\} , \end{aligned}$$
(13)
$$\begin{aligned} T_4'&= \left\{ 0.5,\, 0.505,\, 0.51,\, \ldots ,\, 1 \right\} , \;{\text {and}} \end{aligned}$$
(14)
$$\begin{aligned} T_\text {all}'&= T_1 \cup T_2 \cup T_3 \cup T_4. \end{aligned}$$
(15)

If we choose a Bayesian approach to the problem, we may specify a (relatively uninformative) uniform prior distribution on the model parameters, that is,

$$\begin{aligned} \theta _1&\sim U(0, 10), \end{aligned}$$
(16)
$$\begin{aligned} \text {and } \; \theta _2&\sim U(0, 10). \end{aligned}$$
(17)

The likelihood of our misspecified model is

$$\begin{aligned} {\mathcal {L}}({\varvec{\theta }}; {\textbf{z}}^*) = \prod _{i=1}^n \sqrt{ \dfrac{1}{2\pi \sigma ^2} } \exp \left\{ \dfrac{\left( z^*(t_i)- y(t_i; {\varvec{\theta }}) \right) ^2 }{2 \sigma ^2} \right\} , \end{aligned}$$
(18)

where \(n\) is the number of observations for this protocol, and \({\textbf{z}}^* = (z(t_i))_{i=1}^n\) is a vector of observations of the DGP. We may then explore the posterior distribution (Gelman et al. 2013) using Markov chain Monte Carlo. In particular we use the Haario-Bardenet adaptive-covariance Metropolis-Hastings (Johnstone et al. 2016) algorithm as implemented by PINTS (Clerx et al. 2019b). This method was run using four parallel chains each containing 25,000 iterations and a ‘burn in’ period of 5,000 iterations.

Fig. 1
figure 1

(Color figure online) Under model discrepancy, parameter estimates depend on the design used for training. Panels to the left of the dotted line correspond to designs containing \(n=11\) observations at times (\(T_1, \ldots , T_4\) as shown in panel (a). Panels on the right show designs with \(n=101\) observations, (\(T_1', \ldots , T_4'\) as shown in panel b). a and b representative datasets generated by the DGP shown with the solid black line (Eq. 1) with points indicating observations (sampled using Eq. 2) and the fitted discrepant model (Eq. 4), with calibrated \({\varvec{\theta }}\)) (grey dashed lines). c and d The parameter estimates for each design, each fitted to one of ten repeats of the DGP. e and f 99% Bayesian credible regions obtained using MCMC, a uniform prior and a single repeat of the DGP. g and h Predictions using the discrepant model fitted using a single repeat of each protocol (using the estimates shown in e and f), showing the true DGP (black), discrepant model predictions (red), and the difference between predictions (grey)

As shown in Fig. 1, for each sample of the DGP, we obtain a parameter estimate from each set of observation times, each with a different distribution. For instance, training using \(T_1\) results in a model that approximates the DGP well on short timescales, and training using \(T_4\) allows us to recapitulate the behaviour of the system over longer timescales, as can be seen in panel a. From how closely the discrepant model (Eq. 4) fits the data in the regions where observations are made (in Fig. 1, panels a and b), we can see that in either case, a single exponential seems to provide a reasonable approximation to the DGP. However, if we require an accurate model for both the slow and fast behaviour of the system, model discrepancy presents an issue, and this model (namely, Eq. 4) may be unsuitable. This is the case for \(T_2\) as shown in Fig. 1a. This variability in behaviour is shown in Fig. 1, panels g and h, which show how the model’s predictions for \(0 \le t \le 2\) depend on what protocol was used to fit the model.

The Bayesian posteriors illustrated in Fig. 1, panels e and f, show that we are not able to avoid the problems caused by model discrepancy by simply adopting a Bayesian framework—we will obtain precise parameter estimates that are highly dependent on the chosen training protocol, nevertheless. This problem becomes more obvious when we increase the number of observations. In the examples detailed in this paper, we explore this ‘high-data limit’ where the variability in each parameter estimate (under repeated samples of the DGP) is minuscule compared to the difference between parameter estimates obtained from different protocols.

1.2 Ion Channel Modelling

Discrepancy has proven to be a difficult problem to address in modelling electrically-excitable cells (electrophysiology modelling, Lei et al. 2020b; Mirams et al. 2016). The same is true for many other mathematical models, such as rainfall-runoff models in hydrology (Beven 2006), models of the spread of infectious diseases in epidemiology (Guan et al. 2020; Creswell et al. 2023), and models used for the prediction of financial markets (Anderson et al. 2009).

The ‘rapid delayed rectifier potassium current’ (IKr), carried by the channel encoded primarily by hERG, plays an important role in the recovery of heart cells in from electrical stimulation. It allows the cell membrane to return to its ‘resting potential’ ahead of the next heartbeat. This current can be blocked by pharmaceutical drugs, disrupting this process and causing dangerous changes to heart rhythm. Mathematical models are now routinely used in drug safety assessment to test whether the expected dynamics of IKr under drug block are expected to cause such problems (Li et al. 2017). However, these models provide only an incomplete description of IKr, and do not, for example, account for the stochastic behaviour of individual ion channels (Mirams et al. 2016). For this reason, an understanding of model discrepancy and its implications is crucial in building accurate dynamical models of IKr which permit a realistic appraisal of their predictive uncertainty.

In Sect. 1.1, we presented a simple example, in which each protocol corresponds to a particular choice of observation times. However, there may other aspects of the design to be decided upon. For example, in electrophysiology, whole-cell patch-clamp experiments are performed by placing an electrode in the solution inside the cell membrane (the intracellular solution), and another in the solution outside the cell (the extracellular solution). Voltage-clamp experiments are a particular type of patch-clamp experiment in which a voltage signal is applied across the cell membrane, whilst the current flowing across the cell membrane is recorded. Here, the protocol consists of the chosen voltage for each time (treated as a forcing function in ODE-based models), together with a set of observation times for the resulting current (observed output).

Electrophysiologists have a lot of control, and therefore choice, regarding the protocol design; but little work has been done to explore how the choice of protocol used to gather training data affects the accuracy of subsequent predictions. We explore these protocol-dependent effects of model discrepancy in Sect. 3.

1.3 Previously Proposed Methods to Handle Discrepancy

One way of reducing over-confidence in inaccurate parameter estimates in the presence of model discrepancy may be to use approximate Bayesian computation (ABC) (Frazier et al. 2020). With ABC, a likelihood function is not explicitly specified; instead, the model is repeatedly simulated for proposed values of the parameter sampled from a prior distribution. Each proposed value is accepted or rejected according to whether the simulated trajectory is “close” to the actual data, according to some chosen summary statistics. ABC compares the simulated with the real data using these summary statistics (rather than matching all aspects of the dynamics) and accepts approximate matches (subject to a chosen tolerance). It is suited to inference where there is substantial model discrepancy because this approach can decrease potential over-confidence in the inferred values of parameters. However, it is challenging to select suitable summary statistics, and the computational demands of ABC are much greater than those of the methods we propose.

Another approach was first introduced by Kennedy and O’Hagan (2001), who introduced Gaussian processes to the observables. This work has since been applied to electrophysiology models (Lei et al. 2020b). Elsewhere, Sung et al. introduced an approach to account for heteroscedastic errors using many repeats of the same experiment (Sung et al. 2020), although this seems to be less applicable to the hERG modelling problem (introduced in Sect. 2.2) because the number of repeats of each experiment (when training individual, cell-specific models) is limited. Alternatively, Lei and Mirams (2021) modelled the discrepancy using a neural network within the differential equations. However, these approaches reduce the interpretability of otherwise simple mechanistic models, and, when compared with models that simply ignore model discrepancy, could potentially result in worse predictions under protocols that are dissimilar to those used for training.

Instead, we use a diverse range of experiments to train our models and build a picture of how model discrepancy manifests under different training protocols. We are then able to judge the suitability of our models, and provide empirically-derived, spread-of-prediction intervals which provide a realistic level of predictive uncertainty due to model discrepancy. We demonstrate the utility of these methods under synthetically generated data by constructing two examples of model discrepancy.

2 Methods

We begin with a general overview of our proposed methods before providing two real-world examples of their applications. In Sect. 2.1.1, we outline some notation for a statistical model consisting of a dynamical system, an observation function, and some form of observational noise. This allows us to talk, in general terms, about model calibration and validation in Sect. 2.1.2. In particular, we describe a method for validating our models, in which we change the protocol used to train the model. This motivates our proposed methods for combining parameter estimates obtained from different protocols to empirically quantify model discrepancy for the prediction of unseen protocols.

2.1 Fitting Models Using Multiple Experimental Protocols

2.1.1 Partially Observable ODE Models

In this paper, we restrict attention to deterministic models of biological phenomena, in which a system of ordinary differential equations (ODEs) is used to describe the deterministic time-evolution of some finite number of states. Although, the method would generalise to other types of models straightforwardly. This behaviour may be dependent on the protocol, \(d\), chosen for the experiment, and so, we express our ODE system as,

$$\begin{aligned} \dfrac{\text {d}{\textbf{x}}}{\text {d}t}&= {\textbf{f}}({\textbf{x}}, t; {\varvec{\theta }}_f, d), \end{aligned}$$
(19)

where \({\textbf{x}}\) is a column vector of length \(N\) describing the ‘state’ of the system, \(t\) is time, and the parameters specifying the dynamics of the system are denoted \({\varvec{\theta }}_f\). Additionally, the system is subject to some initial conditions which may be dependent on \({\varvec{\theta }}_f\). Owing to \({\textbf{x}}\)’s dependence on the protocol and model parameters, we use the notation,

$$\begin{aligned} {\textbf{x}}(t; {\varvec{\theta }}_f, d), \end{aligned}$$
(20)

to denote the solution of Eq. 19 under protocol \(d\) and a specific choice of parameters, \({\varvec{\theta }}_f\).

This ODE system is related to our noise-free observables via some observation function of the form,

$$\begin{aligned} h\left( {\textbf{x}}, t; {\varvec{\theta }}_{h}, d\right) , \end{aligned}$$
(21)

where \({\textbf{x}}\) is the state of the ODE system (Eq. 19), \(t\) is the time that the observation occurs, \(d\) is the protocol, and some additional parameters \({\varvec{\theta }}_h\), which are distinct from those in \({\varvec{\theta }}_f\). Here, we make observations of the system, via this function, at a set of observation times, \(\{t_i\}_{i=1}^{n_d}\) defined by the protocol, \(d\).

For concision, we may stack \({\varvec{\theta }}_f\) and \({\varvec{\theta }}_{h}\) into a single vector of model parameters,

$$\begin{aligned} {\varvec{\theta }}= \begin{bmatrix} {\varvec{\theta }}_f \\ {\varvec{\theta }}_h \end{bmatrix}. \end{aligned}$$
(22)

Then, we denote an observation at time \(t_i\) by

$$\begin{aligned} y_i({\varvec{\theta }}; d) = h\big ({\textbf{x}}(t_i; {\varvec{\theta }}_f, d), t_i; {\varvec{\theta }}_{h}, d\big ), \end{aligned}$$
(23)

We denote the set of possible model parameters by \(\Theta \), such that \({\varvec{\theta }}\in \Theta \). We call this collection of possible parameter sets the parameter space.

For each protocol, \(d\in {\mathcal {D}}\), and vector of model parameters, \({\varvec{\theta }}\), we may combine each of our observations into a vector,

$$\begin{aligned} {\textbf{y}}({\varvec{\theta }}; d) = \begin{bmatrix} y_1({\varvec{\theta }}; d), \\ \vdots \\ y_{n_d}({\varvec{\theta }}; d) \end{bmatrix}. \end{aligned}$$
(24)

Additionally, we assume some form of random observational error such that, for each protocol, \(d\), each observation is a random variable,

$$\begin{aligned} z_i(d) = y_i({\varvec{\theta }}; d) + \varepsilon _i, \end{aligned}$$
(25)

where each \(\varepsilon _i\) is the error in the \(i^\text {th}\) observation. Here each protocol, \(d\), is performed exactly once so that we obtain one sample of each vector of observations (the vector \({{\textbf {z}}}(d) = [z_1(d), \dots , z_{n_d}(d)]\)). In the examples presented in Sects. 3.1 and 3.2, we assume that our observations are subject to independent and identically distributed (IID) Gaussian errors, with zero mean 0 and standard deviation, \(\sigma \).

2.1.2 Evaluation of Predictive Accuracy and Model Training

Given some parameter set \({\varvec{\theta }}\), we may evaluate the accuracy of the resultant predictions under the application of some protocol \(d\in {\mathcal {D}}\) by computing the root-mean-square error (RMSE) (Willmott et al. 1985) between these predictions, and our observations (\({\textbf{z}}(d)\)),

$$\begin{aligned} \text {RMSE}\big ({\textbf{y}}({\varvec{\theta }}; d),\;{\textbf{z}}(d)\big ) = \sqrt{\dfrac{1}{n_{d}}\sum _{i=1}^{n_{d}} \big (y_i({\varvec{\theta }}; d) - z_i(d)\big )^2}, \end{aligned}$$
(26)

where \(n_d\) is the number of observations in protocol \(d\). We choose the RMSE as it permits comparison between protocols with different numbers of observations.

Similarly, we may train our models to data, \({\textbf{z}}(d)\), obtained using some protocol, \(d\), by finding the parameter set that minimises this quantity (Eq. 26). In this way, we define the parameter estimate obtained from protocol \(d\) as,

$$\begin{aligned} \hat{\varvec{\theta }}_d = \text {argmin}_{{\varvec{\theta }}\in \Theta }\left\{ \text {RMSE}\big ({\textbf{y}}({\varvec{\theta }}, d),\; {\textbf{z}}(d)\big )\right\} , \end{aligned}$$
(27)

which is a random variable (because it depends on our random data, \({\textbf{z}}\)). Since minimising the RMSE is equivalent to minimising the sum-of-squares error, this estimate is also the least-squares estimator (identical to Eq. 5). Moreover, under the assumption of Gaussian IID errors, Eq. 27 is exactly the maximum likelihood estimator because the natural logarithm of the likelihood can be written as,

$$\begin{aligned} \log \left\{ {\mathcal {L}}({\varvec{\theta }}; {\textbf{z}})\right\} = -\dfrac{n}{2} \log \left( 2\pi \hat{\sigma }^2\right) - \dfrac{1}{2\sigma ^2} \sum _{i=1}^n \big (y_i({\varvec{\theta }}; d) - z_i(d)\big )^2, \end{aligned}$$
(28)

where \(\hat{\sigma }\) is an estimate of \(\sigma \). Equation 28 can be minimised by first finding the parameter set, \(\hat{\varvec{\theta }}\) which minimises the sum-of-squares error term, then finding the optimal \(\sigma \). Whilst these estimates of \({\varvec{\theta }}\) are identical whether or not \(\sigma \) is known, only examples with known (and not estimated) \(\sigma \) are explored in this paper.

Having obtained such a parameter estimate, we may validate our model, by computing predictions for some other protocol, \(\tilde{d}\in {\mathcal {D}}\). To do this, we compute, \( {\textbf{y}}(\hat{\varvec{\theta }}_d; {\tilde{d}}). \) This is a simulation of the behaviour of the system (without noise) under protocol \({\tilde{d}}\) made using parameter estimates that were obtained by training the model to protocol \(d\) (as in Eq. 27). In this way, our parameter estimates, each obtained from different protocols, result in different out-of-sample predictions (predictions for the results for protocols other than the one used for training). Because we aim to train a model able to produce accurate predictions for all \(d\in {\mathcal {D}}\), it is important to validate our model using multiple protocols.

By computing \(\text {RMSE}\big ({\textbf{y}}(\hat{\varvec{\theta }}_d; {\tilde{d}}), {\textbf{z}}({\tilde{d}})\big )\) for each pair of training and validation protocols, \(d\) and \({\tilde{d}}\), we are able to perform model validation across multiple training and validation protocols. This allows us to ensure our models are robust with regard to the training protocol, and allow for the quantification of model discrepancy as demonstrated in Sect. 3.

2.1.3 Consequences of Model Error/Discrepancy

Ideally, we would have a model that is correctly specified, in the sense that the data arise from the model being fitted. In other words, our observations \({\textbf{z}}(d)\) arise from Eq. 25 where \({\varvec{\theta }}\) is some fixed, unknown value, \({\varvec{\theta }}^*\in \Theta \). Then we may consider the distance between an estimate \(\hat{\varvec{\theta }}\) and the true value. When the model is correctly specified and given suitable regularity conditions on the model and the design, d, we can obtain arbitrarily accurate parameter estimates by increasing the number of observations, \(n\). That is, more precisely, that \(\hat{\varvec{\theta }}\) converges in probability to \({\varvec{\theta }}^*\) as \(n \rightarrow \infty \). This property is known as consistency (Seber and Wild 2005). These regularity conditions include that the model is structurally identifiable for the particular \(d\in {\mathcal {D}}\) used for training, for example. That is, different values of the parameter, \({\varvec{\theta }}\), result in different model output (Wieland et al. 2021). Other conditions ensure that \({\textbf{y}}({\varvec{\theta }}; d)\) is suitably smooth as a function of \({\varvec{\theta }}\) (Seber and Wild 2005).

However, when training discrepant models, we may find that our parameter estimates are heavily dependent on the training protocol, as demonstrated in Sect. 1.1. For unseen protocols, these discordant parameter sets may lead to a range of vastly different predictions, even if each parameter set provides a reasonable fit for its respective training protocol. In such a case, further data collection may reduce the variance of these parameter estimates, but fail to significantly improve the predictive accuracy of our models.

In Sect. 3, we explore two examples of synthetically constructed model discrepancy. In Sect. 3.1, we have that \({\textbf{f}}\) and \(h\) (Eqs. 19 and 25) are exactly those functions used to generate the data, and the exact probability distribution of the observational errors is known. However, one parameter is fixed to an incorrect value. In other words, the true parameter set \({\varvec{\theta }}^*\) lies outside the parameter space used in training the model. Under the assumption of structural identifiability (and a compact parameter space), this is an example of model discrepancy because there is some limit to how well our model can recapitulate the DGP.

In Sect. 3.2 we explore another example of model discrepancy where our choice of \({\textbf{f}}\) (and, in this case, the dimensions of \({\varvec{\theta }}\) and \({\textbf{x}}\)) are misspecified by training a model which differs structurally from the one used in the DGP.

2.1.4 Ensemble Training and Prediction Interval

As outlined in Sect. 2.1.2, we can obtain parameter estimates from each protocol \(d\in {\mathcal {D}}\) by finding the \(\hat{\varvec{\theta }}\in \Theta \) that minimises Eq. 27. We then obtain an ensemble of parameter estimates,

$$\begin{aligned} \left\{ \hat{\varvec{\theta }}_d \;: \; d \in {\mathcal {D}}_\text {train}\right\} . \end{aligned}$$
(29)

Then, for any validation protocol \({\tilde{d}}\), the set,

$$\begin{aligned} \left\{ {\textbf{y}} (\hat{\varvec{\theta }}_d; {\tilde{d}}) \;: \; d \in {\mathcal {D}}_{\text {train}}\right\} , \end{aligned}$$
(30)

is an ensemble of predictions where \({\mathcal {D}}_\text {train}\subseteq {\mathcal {D}}\) is some set of training protocols. Each of these estimates may be used individually to make predictions. We may then use these ensembles of parameter estimates to attempt to quantify uncertainty in our prediction. We do this by considering the range of our predictions for each observation of interest. For the \(i\text {th}\) observation of our validation protocol, \({\tilde{d}}\), that is

$$\begin{aligned} {\mathcal {B}}^{(i)}&= \left[ {\mathcal {B}}_\text {lower}^{(i)},{\mathcal {B}}_\text {upper}^{(i)}\right] \nonumber \\&= \left[ \min _{d\in \mathcal {D_\text {train}}}\left\{ y_i(\hat{\varvec{\theta }}_d; {\tilde{d}}) \right\} , \max _{d\in \mathcal {D_\text {train}}}\left\{ y_i(\hat{\varvec{\theta }}_d; {\tilde{d}}) \right\} \right] , \end{aligned}$$
(31)

When all observations are considered at once, Eq. 31 comprises a band of predictions, giving some indication of uncertainty in the predictions. We demonstrate below that this band provides a useful indication of predictive error for unseen protocols, and provides a range of plausible predictions. We propose that a wide band of predictions for a given validation protocol suggests that there is model discrepancy and poor prediction accuracy for a particular context of use.

This interval (Eq. 31), cannot shrink as more protocols are added. If a large number of protocols are considered, percentiles of our ensemble of predictions may provide additional insight. However, in this paper, we only consider cases where there are a small number of protocols (five training protocols are used in each of the examples discussed in Sect. 3).

For the purposes of a point estimate, we use the midpoint of each interval,

$$\begin{aligned} {\mathcal {B}}^{(i)}_\text {mid} = \dfrac{{\mathcal {B}}^{(i)}_\text {lower} + {\mathcal {B}}^{(i)}_\text {upper}}{2}. \end{aligned}$$
(32)

This is used to assess the predictive error of the ensemble in Fig. 8. There are other ways to gauge the central tendency of the set of predictions (Eq. 30). Such a change would have little effect on Sect. 3, but a median or weighted mean may be as (or more) suitable for other problems.

2.2 Application to an Ion Current Model

We now turn our attention to an applied problem in which dynamical systems are used to model cellular electrophysiology. We apply our methods to two special cases of model discrepancy using synthetically generated data.

Firstly, we introduce a common paradigm for modelling macroscopic currents in electrically excitable cells, so-called Markov models (Rudy and Silva 2006; Fink and Noble 2009). In this setting, the term ‘Markov model’ is often used to refer to systems of ODEs where the state variables describe the proportional occupancy of some small collection of ‘states’, and the model parameters affect transition rates between these states. These models are discussed in Sect. 2.2.1 and may be seen as a special case of the more general ODE model introduced in Sect. 2.1. Additionally, in Sect. 2.2.2, we briefly introduce some relevant electrophysiology and in Sect. 2.2.3, we provide a detailed overview of our computational methods.

2.2.1 Markov Models of IKr

Here, we use Markov models to describe the dynamics of IKr, especially in response to changes in the transmembrane potential. For any Markov model (as described above), the derivative function can be expressed in terms of a matrix, \({\textbf{A}}\), which is dependent only on the transmembrane potential, \(V\). Accordingly, where \(x_i\) denotes the proportion of channels in some state, \(i\), Eq. 19 becomes,

$$\begin{aligned} \dfrac{\text {d}{\textbf{x}}}{\text {d}t}&= {\textbf{f}}({\textbf{x}}, t; {\varvec{\theta }}_f, d), \nonumber \\&= {\textbf{A}}\left( V(t; d); {\varvec{\theta }}_f\right) {\textbf{x}}, \end{aligned}$$
(33)

where \(V(t; d)\) is the specified transmembrane potential at the time \(t\) under protocol \(d\). The elements of \({\textbf{A}}(V; {\varvec{\theta }}_f)\), that is, \({\textbf{A}}_{i, j}(V; {\varvec{\theta }}_f)\) describe the transition rate from state \(j\) to state \(i\) with transmembrane potential, \(V\). Usually, the transition rates (elements of \({\textbf{A}}\)) are either constant or of the form \(\theta _i e^{\pm \theta _j V(t; d)}\) with \(\theta _i, \theta _j > 0\). Hence, each transition rate, \(k\) is either \(0\) for all \(V \in {\mathbb {R}}\) or satisfies \(k > 0\) for all \(V \in {\mathbb {R}}\).

Before and after each protocol, cells are left to equilibrate with the voltage \(V\) set to the holding potential, \(V_\text {hold}=-80\) mV. Therefore, we require the initial conditions, for at time \(t=0\),

$$\begin{aligned} {\textbf{x}}(0) = {\textbf{x}}_\infty (V_\text {hold}), \end{aligned}$$
(34)

where, \({\textbf{x}}_\infty (V_\text {hold})\) is the unique steady-state solution for the linear system,

$$\begin{aligned} \dfrac{\text {d}{\textbf{x}}}{\text {d}t} = {\textbf{A}}(V_\text {hold}; {\varvec{\theta }}_f) {\textbf{x}}, \end{aligned}$$
(35)

subject to the constraint \(\sum _{i=1}^N x_i(0) = 1\). Note also that \({\textbf{A}}(V_\text {hold}; {\varvec{\theta }}_f)\) may be singular, as is the case when the number of channels is conserved (\(\sum _{i=1}^Nx_i(t) = 1\) for all \(t\)). This is the case for both Markov models used in this paper. To find \({\textbf{x}}_\infty \), we may follow Fink and Noble ’s method (see Supplementary Material of Fink and Noble 2009). A more technical discussion of the steady states of such ODE systems is found in Keizer (1972).

As is standard for models of IKr (Beattie 2015), we take our observation function to be

$$\begin{aligned} I_\text {Kr} = h({\textbf{x}}, t_i; \theta _{h}, d) = g \cdot [O](t; {\varvec{\theta }}_{f}, d) \cdot (V(t; d) - E_\text {Kr}), \end{aligned}$$
(36)

where \([O]\) denotes the proportion of channels in an ‘open’ conformation (one of the components of \({\textbf{x}}\)); and \(g\) is the sole parameter in \({\varvec{\theta }}_h\), known as the maximal conductance; and \(E_\text {Kr}\) is the Nernst potential. \(E_\text {Kr}\) is found by calculating

$$\begin{aligned} E_\text {Kr} = \dfrac{RT}{F}\ln \left\{ \dfrac{[K_\text {out}]}{[K_\text {in}]}\right\} , \end{aligned}$$
(37)

where \(R\) is the gas constant, \(F\) is Faraday’s constant, and \(T\) is the temperature and \([K_\text {in}]\) and \([K_\text {out}]\) are the intracellular and extracellular concentrations of \(K^+\), respectively. Here, we choose the temperature to be room temperature (\(T=298\text {K}\)), and our intracellular and extracellular concentrations to be 120 mM and 5 mM, respectively, which approximately correspond to physiological concentrations (Hille 2001). Hence, for all synthetic data generation, training, and validation in Sects. 3.1 and 3.2, we fix \(E_\text {Kr} = -80.24\) mV (using Eq. 37).

From Eqs. 3336, we can see that both the dynamics of the model and the observation function are dependent on the voltage, \(V(t; d)\). This is a special case of Eqs. 2325, in which \({\textbf{f}}\) and \(h\) are time-dependent only via the voltage V(td). In other words, at any fixed voltage (\(V\)), Eq. 33 is an autonomous system and \(h\) does not depend directly on \(t\).

We assume that our observational errors are additive Gaussian IID random variables with zero mean and variance, \(\sigma ^2\).

The first model of IKr we consider is by Beattie et al. (2018). This is a four state Markov model with nine parameters (8 of which relate to the model’s kinetics and form \({\varvec{\theta }}_f\)). We use the parameters that Beattie et al. obtained from one particular cell (cell #5) by training their model to data obtained from an application of the ‘sinusoidal protocol’ with a manual patch-clamp setup. The cells used were Chinese hamster ovary cells, which were made to heterologously over-express hERG1a. These experiments were performed at room temperature.

The second model is by Wang et al. (1997). This is a five-state model which has \(15\) parameters. These parameters were obtained by performing multiple voltage-clamp protocols, all at room temperature, on multiple frog oocytes overexpressing hERG. These experiments are used to infer activation and deactivation time constants as well as steady-state current voltage relations, which are, in turn, used to produce estimates for the model parameters. Of the two parameter sets provided in Wang et al. (1997), we use the parameter set obtained by using the extracellular solution with a 2 mM concentration of potassium chloride, as this most closely replicates physiological conditions.

The systems of ODEs for the Beattie and Wang models, as well as the parameterisations of the transition rates, are presented in Appendix 5.1. The values of the model parameters, as used in Sects. 3.1 and 3.2 are given in Table 1.

2.2.2 Experimental Designs for Voltage Clamp Experiments

A large amount of data can be recorded in voltage-clamp electrophysiology experiments: the current can be recorded at a several-kHz sampling rate, for many minutes. In what follows, we take observations of the current at the same 10 kHz frequency for all protocols. Experimenters have a great deal of flexibility when it comes to specifying voltage-clamp protocol designs. We have published a number of studies on the benefits of ‘information-rich’ experimental designs for these protocols, focusing on short protocols which explore a wide range of voltages and timescales (Beattie et al. 2018; Lei et al. 2019a, b; Clerx et al. 2019a; Kemp et al. 2021). In a real patch-clamp experiment, the amount of data we can obtain from each cell is limited. Hence, it is not feasible to perform many long protocols in sequence on the same cell. For this reason, we use six short information-rich protocols, denoted \(d_0\) to \(d_5\), as shown in Fig. 3.

Here, we use simple designs consisting of a combination of sections where the voltage is held constant or ‘ramps’ where the voltage increases linearly with time for compatibility with automated high-throughput patch clamp machines which are restricted to protocols of this type. For the protocols included in this paper, short identical sequences including ramps are placed at the beginning and end of each protocol. In real experiments, these elements will allow for quality control, leak subtraction, and the estimation of the reversal potential (Lei et al. 2019a, b). The central portion, consisting of steps during which the voltage is held constant, is what varies between protocols.

Not all possible designs are suitable for training models. Sometimes we encounter protocols for which distant pairs of parameter sets yield approximately equal model output—i.e. the model output for a protocol is not sensitive to certain (possibly large) changes in the model parameters. Subsequently, when training the model to data generated from this protocol, many different parameter sets give similar fits that are almost equally plausible. This problem is loosely termed numerical unidentifiability (Fink and Noble 2009) and generally speaking is best avoided, unless the resulting uncertainty in the model parameters is known to be immaterial regarding any possible future context of use.

For both the Beattie and Wang models, numerical unidentifiability is a problem for design \(d_0\) (data not shown, but this phenomenon is illustrated for a similar IKr model and protocol in Fig. 3 of Whittaker et al. (2020)). Yet \(d_0\) mimics the transmembrane voltage of a heart cell in a range of scenarios, and so provides a good way to validate whether our models recapitulate well-studied, physiologically-relevant behaviour. In particular, the central portion of this voltage-protocol consists of a sequence of wave-forms, each of which resembles the action potential of muscle cells found in the heart (ten Tusscher et al. 2004). So in this study we use \(d_0\) as a validation protocol, but do not use it as a protocol for training models.

The remaining designs, \(d_1\)\(d_5\), were constructed using various criteria under constraints on voltage ranges and the duration of each step. The design \(d_1\) was designed algorithmically by sampling from a probability distribution placed over possible parameter sets and maximising the difference in model outputs between all pairs of parameter sets sampled from this distribution; \(d_5\) was the result of the same algorithm applied to the Wang model. In contrast, \(d_4\) is a manual design we have used previously (Lei et al. 2019a) based on a simplification of a sinusoidal design (Beattie et al. 2018). The design, \(d_2\) is a further manual refinement of \(d_4\) to explore inactivation processes (rapid loss of current at high voltages) more thoroughly. Finally, \(d_3\) is based on maximising the exploration of the model phase-space for the Beattie model, visiting as many combinations of binned model states and voltages as possible. The main thing to note for this study however, is that \(d_1\)\(d_5\) result in good numerical identifiability (Fink and Noble 2009) for both models—that is, when used in synthetic data studies that attempt to re-infer the underlying parameters, all five protocol designs yield very low-variance parameter estimates (as shown in Fig. 2 for \(\lambda =1\) for the Beattie model, and Fig. 5 for the Wang model). This is a useful property, because it allows us to disregard the (very small) effect of different random noise in the synthetic data on the spread of our predictions (Eq. 31).

Fig. 2
figure 2

(Color figure online) The structural differences between the two Markov model structures used in this paper for synthetic data generation and model training. a The four-state Beattie model used in both Case I and Case II. b The five-state Wang model used only for Case II. When a channel is in the open/conducting (O) state (green) current is able to flow. Whereas, when the model is in the other closed (C) or inactivated (I) states, no current can flow. The arrows adjacent to each model structure indicate the direction in which rates increase as the voltage increases

2.2.3 Computational Methods

Numerical solution of ODEs

Any time we calculate \({\textbf{y}}({\varvec{\theta }}; d)\), we must solve a system of ordinary differential equations. We use a version of the LSODA solver designed to work with the Numba package, and Python to allow for the generation of efficient just-in-time compiled code from symbolic expressions. We partitioned each protocol into sections where the voltage is constant or changing linearly with respect to time because this sped up our computations. We set LSODA’s absolute and relative solver tolerances to \(10^{-8}\). The fact that the total number of channels is conserved in our models, allows us to reduce the number of ODEs we need to solve from \(N\) to \(N - 1\) (Fink and Noble 2009).

Synthetic data generation

Having computed the state of the system at each observation time, \(({\textbf{x}}(t_i, {\varvec{\theta }}^*, d))_{i=1}^{n_d}\), it is simple to compute \(y_i\) by substituting \({\textbf{x}}\) into our observation function (Eq. 25). Finally, to add noise, we obtain \(n_d\) independent samples using Eq. 25, using NumPy’s (Harris et al. 2020) interface to the PCG-64 pseudo-random number generator. Here, because we are using equally spaced observations with a 10 kHz sampling rate, \(n_d = 10^4 \times t_\text {dur}\) where \(t_\text {dur}\) is the duration of the protocol in seconds.

Optimisation

Finding the least-squares estimates, or, equivalently, minimising Eq. 26 is (in general) a nonlinear optimisation problem for which there exist many numerical methods. We use CMA-ES (Hansen 2006) as implemented by the PINTS interface (Clerx et al. 2019b). CMA-ES is a global, stochastic optimiser that has been applied successfully to many similar problems.

Fig. 3
figure 3

Left: a range of different input voltage-clamp protocols (forcing functions) used in this study. Right: corresponding synthetic output data IKr simulated using the Beattie model with noise added as described in Sect. 2.2.3. Here, we generate and plot data observed at a 10 kHz sampling rate. Training protocols (all protocols except \(d_0\)) were tested for numerical identifiability (Fink and Noble 2009): inverse problems performed on synthetic data with repeatedly sampled random noise yielded parameter estimates with little variability

We follow the optimisation advice described in Clerx et al. (2019a). That is, for parameters ‘a’ and ‘b’ in state transition rates of the form \(k=a \exp {(bV)}\), the optimiser works with ‘\(\log a\)’ and untransformed ‘b’ parameters. We enforce fairly lenient constraints on our parameter space, \(\Theta \), to prevent a proposed parameter set from forcing transitions to become so fast/slow that the ODE system becomes very stiff and computationally difficult to solve. In particular, we take a similar approach to Clerx et al. (2019a) we require that every parameter is positive, and, for ease of computation,

$$\begin{aligned} 1.67\times 10^{-5} \,\text {ms}^{-1} \le k_\text {max} \le 10^{3}\,\text {ms}^{-1} \end{aligned}$$
(38)

where \(k_\text {max}\) is the maximum transition rate, \(k(V)\), for all

$$\begin{aligned} V \in [-120\,\text {mV}, +60\,\text {mV}], \end{aligned}$$

which is the voltage range used in our protocols (Fig. 3).

Because CMA-ES is a stochastic algorithm, repeated runs can produce different output. To ensure that we have found the global minimum (Eq. 27), we repeat every optimisation numerous times (25 repeats for \(\lambda =1\) in Case I, \(5\) repeats for subsequent \(\lambda \), and \(25\) repeats in Case II).

Moreover, in Sect. 3.2, when training the discrepant model, our initial guesses for the kinetic parameters were randomly sampled using

$$\begin{aligned} \log _{10}(p) \sim U(-7, -1), \end{aligned}$$
(39)

whereas we set the maximal conductance initial guess (which only affects the observation function) to the value used for data generation (even though these data were generated using a different model structure). We then check that our parameter set satisfies Eq. 38, and resample if necessary before commencing the optimisation routine.

The examples presented in Sects. 3.1 and 3.2 require the solution of many optimisation problems. For speed, these tasks may be organised in such a way that multiple optimisation problems can be solved in parallel.

3 Results

In this section, we use synthetically generated data to explore two cases of model discrepancy in Markov models of IKr. In this first case, we gradually introduce discrepancy into a model with the correct structure by fixing one of its parameters to values away from the DGP parameter set. Then, in Sect. 3.2, we apply the same methods to another case where the model structure is incorrectly specified. In both cases, we take a literature model of IKr together with Gaussian IID noise to be the DGP.

3.1 Case I: Misspecified Maximal Conductance

In this case, we assume a correctly specified model, but assume increasingly erroneous values of one particular parameter and investigate how this impacts the protocol-dependence our parameter estimates and the predictive accuracy of our models. Also, we explore how the spread in our model predictions (Eq. 31) increases as the amount of discrepancy increases (in a particular manner).

To do this, we simulate data generation from each training protocol, as outlined, ten times using Gaussian IID noise with standard deviation (0.03nA). Specifically, we take the true DGP to be the Beattie model, as shown in Fig. 2. Then, we fix the maximal conductance (\(g\)) to a range of values, and infer the remaining model parameters from the synthetic data, generated using the true parameter set, \({\varvec{\theta }}^*\). We assume that the standard deviation of the Gaussian noise is known because it can be well estimated from the initial portion of each protocol where the current is stationary.

When training our models, we use a restriction of the usual parameter space to fit the data by assuming some fixed value, \(\lambda \), for the maximal conductance, \(g\). In this way, we reformulate the optimisation problem slightly such that Eq. 27 becomes

$$\begin{aligned} \hat{\varvec{\theta }}_{\lambda }(d) = \text {argmin}_{{\varvec{\theta }}\in \Theta _\lambda } \left\{ \text {RMSE}\left( {\textbf{y}}({\varvec{\theta }}; d), {\textbf{z}}(d)\right) \right\} , \end{aligned}$$
(40)

where \(\Theta _\lambda \) is the subset of parameter space where the maximal conductance is fixed to \(\lambda g\). For each repeat of each protocol, we solve this optimisation problem for each scaling factor, \(\lambda \in \left\{ \frac{1}{4}, \frac{1}{2}, 1, 2, 4 \right\} \). These calculations are identical to those used in the computation of profile likelihoods under the assumption of additive IID Gaussian errors (Bates 1988).

Next, we fit these restricted parameter-space models to the same dataset and assess their predictive power under the application of a validation protocol. We do this for each possible pair of training and validation protocols.

To reduce the time required for computation we fit our discrepant models sequentially, starting at \(\lambda =1\) and increasing or decreasing \(\lambda \), using previous parameter estimates as an initial guess. This is done so that, for example, the kinetic parameters found by fixing \(\lambda = 2\) are used as our initial guess when we fit the model with \(\lambda = 4\), unless the original kinetic parameters (Table 1) provide a lesser RMSE than the results of the previous optimisation.

The spread in predictions for the validation protocol, \(d_0\), for, \(\lambda \in \{\frac{1}{4}, 1, 4\)} is shown in Fig. 4. A more complete summary of these results is provided by Fig. 5. Here, when \(\lambda = 1\) (the central row of Fig. 5), we can see that no matter what protocol is used to train the model, the distribution of parameter estimates (panel a) is centred around their true values, and the resultant predictions are all accurate (Fig. 5, panels b and c). In contrast, when the maximal conductance, \(g\) is set to an incorrect value our parameter estimates become biased, and overall, our predictions become much less accurate. This effect on predictive accuracy is also shown in Fig. 10.

Moreover, the inaccuracy in our parameter estimates and our predictions varies depending on the design used to train the model. This effect does not appear to be symmetrical, with \(\lambda < 1\) seemingly resulting in more model discrepancy than \(\lambda > 1\).

Fig. 4
figure 4

(Color figure online) The set of predictions (Eq. 30) shown for parameter estimates obtained by training with different values of \(\lambda \) to synthetic data under \(d_1, \ldots , d_5\) (using the Beattie model). The synthetically generated data used for model validation are shown in grey and and the spread of the predictions is highlighted in yellow. a The voltage trace for \(d_0\). b The set of predictions with \(\lambda = \frac{1}{4}\). c The set of predictions with \(\lambda = 1\), that is, under the assumption of the correct maximal conductance (\(g\)). d The set of predictions with \(\lambda = 4\). N.B. the ‘angular’ nature of the current is not a plotting artefact, but reflects the fact the voltage clamp (a) is constructed from a series of linear ramps for compatibility with automated voltage clamp machines

Fig. 5
figure 5

(Color figure online) Discrepancy in parameter estimates and subsequent currents when a non-discrepant model is fitted to synthetic data, with all parameters free except the maximal conductance, \(g\), which is scaled by some factor \(\lambda \), (\(g=\lambda g^*\), where \(g^*\) is the true value). a Estimates of \(\theta _1\) and \(\theta _2\) obtained by training with different protocols for 10 repeats of the DGP. The lines (linearly interpolated using 17 values for \(\lambda \in [ \frac{1}{4}, 4] \)) show how the estimates from each protocol improve as \(\lambda \rightarrow 1\). b \(d_0\) voltage protocol. c The spread of predictions of \(I_\text {Kr}\) under the \(d_0\) protocol using the parameter estimates in Column a. d A heatmap showing the predictive error obtained by training and validating for each pair of protocols. Here c corresponds with the top row of each heatmap, as indicated

Fig. 6
figure 6

(Color figure online) Case II: the set of predictions (Eq. 30) shown for parameter estimates obtained by training Beattie and Wang models with data synthetically generated using the Wang model. a The \(d_0\) voltage-clamp protocol. b The set of predictions using the Beattie model. c The set of predictions with Wang model, that is, with under the assumption of the correct model structure

Fig. 7
figure 7

(Color figure online) Model discrepancy between the Beattie model and synthetic data generated using the Wang model. a estimates of the maximal conductance obtained by training with different protocols for ten repeats of the DGP. There is a noticeable protocol dependence for estimates obtained using the (discrepant) Beattie model, but the true underlying parameter (dashed line) can be accurately determined from any protocol when using the (correct) Wang model. b \(d_0\) voltage protocol. c the spread of predictions for \(I_\text {Kr}\) under the \(d_0\) protocol for discrepant (Beattie) and correct (Wang) models which are shown in more detail in Fig. 6. d cross-validation heatmaps for both the Beattie and Wang models fitted to this suite of protocols, averaged over ten repeated samples of the DGP for each protocol

Fig. 8
figure 8

(Color figure online) The spread in predictions obtained from different protocols provides a useful indicator of model discrepancy for Case II. a The validation voltage-clamp protocol, \(d_0\), with colours corresponding to panels d and e. b The spread-of-predictions interval (Eq. 31) for \(d_0\) using the Beattie model trained with \(d_1, \ldots , d_5\). c The true DGP subtracted from the spread-in-predictions interval. d The true DGP with the colour of each observation corresponding to panels a and e. e a scatter plot of the midpoint prediction (Eq. 32) and the width of the predictive interval (Eq. 31) for every observation in \(d_0\). Here, the red, dashed lines show the true value of \(I_\text {Kr}\) lies on the extremes of the range of predictions. Accordingly, points above these lines show the observations for which the DGP lies inside this range, and the points below the line correspond to observations for which the true DGP lies outside this range. The colours of these points correspond to those in panels a and d

Further results are provided in Sect. 5.2. Figure 10 shows the error in our predictions of \(d_0\) as \(\lambda \) varies, Table 2 examines the distribution of our parameter estimates for each protocol (under repeated samples of the DGP) for different values of \(\lambda \) and Table 3 shows the behaviour of our spread-of-predictions interval (Eq. 31) and midpoint prediction Eq. 32 for different values of \(\lambda \).

3.2 Case II: Misspecified Dynamics

Next, we apply these methods to an example where we have misspecified the dynamics of the model (the function \({\textbf{f}}\)). We use two competing Markov models of hERG kinetics, the Beattie model (Beattie et al. 2018), and the Wang model (Wang et al. 1997). These models have differing structures and differing numbers of states, as shown in Fig. 2. We generate a synthetic dataset under the assumption of the Wang model with Gaussian IID noise (with standard deviation \(0.03\)nA) and the original parameter set as given in Wang et al. , for all the protocols shown in Fig. 3. As in Case I, we assume the standard deviation of this noise is known.

Then, we are able to fit both models to this dataset, obtaining an ensemble of parameter estimates and performing cross-validation as described in Sect. 2. In this way, we can assess the impact of choosing the wrong governing equations (the choice of \({\textbf{f}}\) in Eq. 19), and its impact on the predictive accuracy of the model. We do this to investigate whether the techniques introduced in Sect. 2.1 are able to provide some useful quantification of model discrepancy when the dynamics of IKr are misspecified.

Our results, presented in Figs.  6 and 7, show how we expect a correctly specified model to behave in comparison to a discrepant model. We can see from the bottom row of Fig. 7, that when training using the correctly specified derivative matrix, we were able to accurately recover the true maximal conductance using each and every protocol. Moreover, similarly to Case I, no matter which protocol the correctly specified model was trained with, the resultant predictions were very accurate (as can be seen in Fig. 6).

However, when the discrepant model was used, there was significant protocol dependence in our parameter estimates, and our predictions were much less accurate overall, but perhaps accurate for many applications. Moreover, it seems that for the majority of \(d_0\), the spread in predictions across training protocols (Eq. 31) was smaller than those seen in Case I, but there are certain portions where the discrepant model and DGP are noticeably different (as highlighted in Fig. 6). This may be due to the structural differences between the Wang and Beattie model. In particular, in the Wang model, channels transitioning from the high-voltage inactive state (I), must transition through the conducting, open state (O) in order to reach low-voltage closed states (C1, C2, C3), causing a spike of current to be observed. Instead, channels in the Beattie model may transition through the inactive-and-closed state (IC) on their way between O and C, resulting in reduced current during steps from high voltage to low voltage.

Nevertheless, our methods provide a useful indication of this model discrepancy. Figure 8, examines the behaviour of our prediction interval (Eq. 31) in more detail. Importantly, we can see that our interval shows little uncertainty during sections of the protocol where there is little current (this is also seen in Fig. 7a and Fig. 6). This is ideal, because no reasonable model would predict a sizeable current here. On the other hand, we see that our intervals show significant uncertainty around the spikes in current that occur at the start of each action-potential waveform. This is to be expected because it is known that these sections of the current time-series are particularly sensitive to differences in the ‘rapid inactivation’ process in these models (Clerx et al. 2019a).

Further results regarding Case II are provided in Sect. 5.3. Tables 4 and 5 summarise the behaviour of our parameter estimates for each choice of model.

4 Discussion

We have introduced an uncertainty quantification (UQ) approach to highlight when discrepancy is affecting model predictions. We demonstrated the use of this technique by providing insight into the effects of model discrepancy on a model of IKr in electrically excitable cells. Here, we saw that under synthetically constructed examples of model discrepancy, there was great variability between the parameter estimates obtained using different experimental designs. This variability is a consequence of the different compromises that a discrepant model has to make to fit different regimes of a true DGP’s behaviour. Consequently, these parameter estimates produced a wide range of behaviour during validation, despite each individual parameter estimate having little variability under repeated samples of the DGP.

The variability in the model predictions stemming from this ensemble of parameter estimates is, therefore, an empirical way of characterising the predictive uncertainty due to model discrepancy. Usefully, our spread-of-prediction intervals (Eq. 31) correctly indicated little uncertainty when the ion channel model was exhibiting simple dynamics decaying towards a steady state, but more uncertainty during more complex dynamics, which was indeed when the largest discrepancies occurred. For many observations under our validation protocol, the true, underlying DGP lay inside this interval, indicating that Eq. 31 may provide a useful indication of predictive error under unseen protocols. We expect that the presented methods may be of use for problems where the variability in parameter estimates (from repetitions of each individual protocol) is smaller than the variability between parameter estimates obtained from different protocols—because there is little noise, and lots of observations for example. In such cases, the variability in the extremes of our ensembles (\({\mathcal {B}}_\text {upper}\) and \({\mathcal {B}}_\text {lower}\)) is immaterial compared to the width of the interval (\({\mathcal {B}}_\text {upper} - {\mathcal {B}}_\text {lower}\)).

Fig. 9
figure 9

(Color figure online) The Beattie and Wang models may be seen as special cases of this more complicated model. The state labelled ‘C3’ is called ‘C’ in the Beattie model and ‘C3’ in the Wang model. The arrows outside the Markov state diagram indicate the direction in which rates increase with increasing voltage

At first, Case I may seem like an artificial example—in practice, the maximal conductance is taken to be a model parameter and fitted along with the rest of the model. But Case I and Case II are similar: any two Markov models may be regarded as two special cases of a more general model with some transition rates pinned to \(0\) (as shown in Fig. 9 for the models used in this paper, with some transition rate). Like in Case I, this means that different model structures can be seen as restrictions of this larger model’s parameter space. Misspecified model structures can then be identified with subsets of parameter space which do not contain the true, data-generating parameter set (provided this larger model is structurally identifiable).

This means there is a setting in which Case II (misspecified governing equations) is an example of the type of discrepancy explored in Case I, where a “true” parameter value exists in the more general model, but is excluded in the parameter space being optimised over when training the model. This may prove a valuable perspective for modelling ion channel kinetics, where there are many candidate models (Mangold et al. 2021), and each model may be seen as corresponding to some subset of a general model with a shared higher-dimensional parameter space. Model selection problems have been framed in this way previously (Akaike 1998; Chen et al. 2017).

4.1 Limitations

Whilst the spread of predictions under some unseen protocol may provide some feasible range of predictions, we can see from Fig. 8, that our observables (the DGP without noise) often lie outside this range. This shown in Fig. 6. Here, certain structural differences between the model and DGP may mean that the truth lies outside. For this reason, Eq. 31 is best interpreted as a heuristic indication of predictive uncertainty stemming from model discrepancy, rather than providing any guarantees about the output of the DGP.

Using more training protocols in the training set may increase the coverage of the DGP by our interval. Whilst the number of protocols that can be performed on a single biological cell is limited by time constraints (Beattie et al. 2018; Lei et al. 2019b), the utilisation of more protocols is likely preferable.

Besides the types of discrepancies considered in Sect. 3, there are other ways that the DGP can differ from the fitted models. For example, the DGP may not be accurately described by an ODE system, especially when ion channel numbers are small and the stochasticity of individual channels opening and closing is apparent. In this circumstance, the models can be cast in terms of stochastic differential equations (SDEs), as in Goldwyn et al. (2011), and we can again consider an ensemble of parameter estimates (Eq. 29) and an ensemble of model predictions (Eq. 30. The assumption of IID Gaussian errors for the observation noise model could also be inaccurate: auto-correlated noise processes (e.g. as explored in Creswell et al. 2020; Lambert et al. 2022), or even experimental artefacts may be present, but all of these could be included in the modelled DGP (Lei et al. 2020a) and it remains to be seen how well our method would perform in these cases.

4.2 Future Directions

We were able to quantify model discrepancy by considering the predictive error of our models across a range of training and validation protocols. This provides a way of quantifying model discrepancy that can be compared across models, and could be used to select the most suitable model from a group of candidate models. For a given context of use, we suggest that the spread of predictions can be used to gauge the trustworthiness of a model’s predictions. Even in a model which produces a plausible fit to each individual training protocol, a wide spread of predictions may indicate that a model is ill-suited to a particular predictive task, and should prompt careful reconsideration of the model and the experiments used for its training. In this way, the disagreement between model predictions of \(d_0\) in Sect. 3.1 shows that the \(\lambda = \frac{1}{4}\) may not be suitable, owing to the relative width of this band of predictions, even in the absence of validation data, or knowledge of a more suitable model.

Our approach may provide insight into improved experimental design criteria (Lei et al. 2022). Optimal experimental design approaches that assume knowledge of a correctly specified model may not be the most suitable in the presence of known discrepancy. By adjusting these approaches to account for the uncertainty in choice of model, we may be able to use these ideas to design experiments which allow for more robust model validation. One method would be to fit to data from a collection of training protocols, and to find a new protocol, for which the spread in ensemble predictions (Eq. 31) is maximised.

In this paper, we have applied our methodology to mathematical models of electrophysiology. In both the cases we considered, we saw that model discrepancy could lead to inaccurate predictive models, and due to the use of information-rich training protocols, the variability of our parameter estimates, under repeated samples of the DGP, was negligible. We propose that our methodology could be applied to similar problems, where there is high-frequency time-series data is available as well as mathematical model which are relatively accurate. There are many such biochemical reaction networks for which model selection remains a challenge, and there are numerous approaches to finding suitable mathematical models Klimovskaia et al. (2016). We propose the methodology outlined in this paper may be used to quantify the discrepancy in such models.

In these examples, we saw little variability due to noise in our parameter inference, and therefore in our ensemble prediction’s spread-of-prediction intervals and/or midpoint predictions, as shown in Sect. 5.2 and Sect. 5.3. However, in other cases where there are fewer observations and more observational noise (for example), it may be more suitable to consider an analogous distribution-based approach where we consider Bayesian posteriors of our parameters instead of point estimates (such as the maximum likelihood estimator we used in this paper).

4.3 Concluding Remarks

The spread of predictions of our ensembles, based on training to data from multiple experimental designs, provides a good indication of possible predictive error due to model discrepancy. Ultimately, whilst our ensemble approach is no substitute for a correctly specified model, it is a useful tool for quantifying model discrepancy, predicting the size and direction of its effects, and may guide further experimental design and model selection approaches.